Re: Problem using limit clause in spark sql

2015-12-26 Thread tiandiwoxin1234
As for 'rdd.zipwithIndex.partitionBy(YourCustomPartitioner)', can I just drop 
some records using my custom partitioner, otherwise I still have to call 
rdd.take() to get exactly 1 records.

And repartition is THE expensive operation that I want to walk around.

Actually, what I expect the limit clause would do is using some kind of 
coordinator to assign each partition a number of records to reserve and the sum 
of which is exactly the limit(or ). But it seems this cannot be easily done.  

> 在 2015年12月25日,下午11:10,manasdebashiskar [via Apache Spark User List] 
> <ml-node+s1001560n25797...@n3.nabble.com> 写道:
> 
> It can be easily done using an RDD. 
> 
> rdd.zipwithIndex.partitionBy(YourCustomPartitioner) should give you your 
> items. 
> Here YourCustomPartitioner will know how to pick sample items from each 
> partition. 
> 
> If you want to stick to Dataframe you can always repartition the data after 
> you apply the limit. 
> 
> ..Manas 
> 
> If you reply to this email, your message will be added to the discussion 
> below:
> http://apache-spark-user-list.1001560.n3.nabble.com/Problem-using-limit-clause-in-spark-sql-tp25789p25797.html
>  
> <http://apache-spark-user-list.1001560.n3.nabble.com/Problem-using-limit-clause-in-spark-sql-tp25789p25797.html>
> To unsubscribe from Problem using limit clause in spark sql, click here 
> <http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code=25789=dGlhbmRpd294aW5AaWNsb3VkLmNvbXwyNTc4OXwtOTkyODc3MDI5>.
> NAML 
> <http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer=instant_html%21nabble%3Aemail.naml=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Problem-using-limit-clause-in-spark-sql-tp25789p25798.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Problem using limit clause in spark sql

2015-12-25 Thread manasdebashiskar
It can be easily done using an RDD.

rdd.zipwithIndex.partitionBy(YourCustomPartitioner) should give you your
items.
Here YourCustomPartitioner will know how to pick sample items from each
partition.

If you want to stick to Dataframe you can always repartition the data after
you apply the limit.

..Manas



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Problem-using-limit-clause-in-spark-sql-tp25789p25797.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Problem using limit clause in spark sql

2015-12-23 Thread tiandiwoxin1234
Hi,
I am using spark sql in a way like this:

sqlContext.sql(“select * from table limit 1”).map(...).collect()

The problem is that the limit clause will collect all the 10,000 records
into a single partition, resulting the map afterwards running only in one
partition and being really slow.I tried to use repartition, but it is kind
of a waste to collect all those records into one partition and then shuffle
them around and then collect them again.

Is there a way to work around this? 
BTW, there is no order by clause and I do not care which 1 records I get
as long as the total number is less or equal then 1.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Problem-using-limit-clause-in-spark-sql-tp25789.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Problem using limit clause in spark sql

2015-12-23 Thread 汪洋
Hi,
I am using spark sql in a way like this:

sqlContext.sql(“select * from table limit 1”).map(...).collect()

The problem is that the limit clause will collect all the 10,000 records into a 
single partition, resulting the map afterwards running only in one partition 
and being really slow.I tried to use repartition, but it is kind of a waste to 
collect all those records into one partition and then shuffle them around and 
then collect them again.

Is there a way to work around this? 
BTW, there is no order by clause and I do not care which 1 records I get as 
long as the total number is less or equal then 1.

Re: Problem using limit clause in spark sql

2015-12-23 Thread Zhan Zhang
There has to have a central point to collaboratively collecting exactly 1 
records, currently the approach is using one single partitions, which is easy 
to implement.
Otherwise, the driver has to count the number of records in each partition and 
then decide how many records  to be materialized in each partition, because 
some partition may not have enough number of records, sometimes it is even 
empty.

I didn’t see any straightforward walk around for this.

Thanks.

Zhan Zhang



On Dec 23, 2015, at 5:32 PM, 汪洋 
<tiandiwo...@icloud.com<mailto:tiandiwo...@icloud.com>> wrote:

It is an application running as an http server. So I collect the data as the 
response.

在 2015年12月24日,上午8:22,Hudong Wang 
<justupl...@hotmail.com<mailto:justupl...@hotmail.com>> 写道:

When you call collect() it will bring all the data to the driver. Do you mean 
to call persist() instead?


From: tiandiwo...@icloud.com<mailto:tiandiwo...@icloud.com>
Subject: Problem using limit clause in spark sql
Date: Wed, 23 Dec 2015 21:26:51 +0800
To: user@spark.apache.org<mailto:user@spark.apache.org>

Hi,
I am using spark sql in a way like this:

sqlContext.sql(“select * from table limit 1”).map(...).collect()

The problem is that the limit clause will collect all the 10,000 records into a 
single partition, resulting the map afterwards running only in one partition 
and being really slow.I tried to use repartition, but it is kind of a waste to 
collect all those records into one partition and then shuffle them around and 
then collect them again.

Is there a way to work around this?
BTW, there is no order by clause and I do not care which 1 records I get as 
long as the total number is less or equal then 1.




Re: Problem using limit clause in spark sql

2015-12-23 Thread 汪洋
I see.  

Thanks.


> 在 2015年12月24日,上午11:44,Zhan Zhang <zzh...@hortonworks.com> 写道:
> 
> There has to have a central point to collaboratively collecting exactly 1 
> records, currently the approach is using one single partitions, which is easy 
> to implement. 
> Otherwise, the driver has to count the number of records in each partition 
> and then decide how many records  to be materialized in each partition, 
> because some partition may not have enough number of records, sometimes it is 
> even empty.
> 
> I didn’t see any straightforward walk around for this.
> 
> Thanks.
> 
> Zhan Zhang
> 
> 
> 
> On Dec 23, 2015, at 5:32 PM, 汪洋 <tiandiwo...@icloud.com 
> <mailto:tiandiwo...@icloud.com>> wrote:
> 
>> It is an application running as an http server. So I collect the data as the 
>> response.
>> 
>>> 在 2015年12月24日,上午8:22,Hudong Wang <justupl...@hotmail.com 
>>> <mailto:justupl...@hotmail.com>> 写道:
>>> 
>>> When you call collect() it will bring all the data to the driver. Do you 
>>> mean to call persist() instead?
>>> 
>>> From: tiandiwo...@icloud.com <mailto:tiandiwo...@icloud.com>
>>> Subject: Problem using limit clause in spark sql
>>> Date: Wed, 23 Dec 2015 21:26:51 +0800
>>> To: user@spark.apache.org <mailto:user@spark.apache.org>
>>> 
>>> Hi,
>>> I am using spark sql in a way like this:
>>> 
>>> sqlContext.sql(“select * from table limit 1”).map(...).collect()
>>> 
>>> The problem is that the limit clause will collect all the 10,000 records 
>>> into a single partition, resulting the map afterwards running only in one 
>>> partition and being really slow.I tried to use repartition, but it is kind 
>>> of a waste to collect all those records into one partition and then shuffle 
>>> them around and then collect them again.
>>> 
>>> Is there a way to work around this? 
>>> BTW, there is no order by clause and I do not care which 1 records I 
>>> get as long as the total number is less or equal then 1.
>> 
> 



Re: Problem using limit clause in spark sql

2015-12-23 Thread Gaurav Agarwal
I am going to have the above scenario without using limit clause then will
it work check among all the partitions.
On Dec 24, 2015 9:26 AM, "汪洋" <tiandiwo...@icloud.com> wrote:

> I see.
>
> Thanks.
>
>
> 在 2015年12月24日,上午11:44,Zhan Zhang <zzh...@hortonworks.com> 写道:
>
> There has to have a central point to collaboratively collecting exactly
> 1 records, currently the approach is using one single partitions, which
> is easy to implement.
> Otherwise, the driver has to count the number of records in each partition
> and then decide how many records  to be materialized in each partition,
> because some partition may not have enough number of records, sometimes it
> is even empty.
>
> I didn’t see any straightforward walk around for this.
>
> Thanks.
>
> Zhan Zhang
>
>
>
> On Dec 23, 2015, at 5:32 PM, 汪洋 <tiandiwo...@icloud.com> wrote:
>
> It is an application running as an http server. So I collect the data as
> the response.
>
> 在 2015年12月24日,上午8:22,Hudong Wang <justupl...@hotmail.com> 写道:
>
> When you call collect() it will bring all the data to the driver. Do you
> mean to call persist() instead?
>
> --
> From: tiandiwo...@icloud.com
> Subject: Problem using limit clause in spark sql
> Date: Wed, 23 Dec 2015 21:26:51 +0800
> To: user@spark.apache.org
>
> Hi,
> I am using spark sql in a way like this:
>
> sqlContext.sql(“select * from table limit 1”).map(...).collect()
>
> The problem is that the limit clause will collect all the 10,000 records
> into a single partition, resulting the map afterwards running only in one
> partition and being really slow.I tried to use repartition, but it is
> kind of a waste to collect all those records into one partition and then
> shuffle them around and then collect them again.
>
> Is there a way to work around this?
> BTW, there is no order by clause and I do not care which 1 records I
> get as long as the total number is less or equal then 1.
>
>
>
>
>


Re: Problem using limit clause in spark sql

2015-12-23 Thread 汪洋
It is an application running as an http server. So I collect the data as the 
response.

> 在 2015年12月24日,上午8:22,Hudong Wang <justupl...@hotmail.com> 写道:
> 
> When you call collect() it will bring all the data to the driver. Do you mean 
> to call persist() instead?
> 
> From: tiandiwo...@icloud.com
> Subject: Problem using limit clause in spark sql
> Date: Wed, 23 Dec 2015 21:26:51 +0800
> To: user@spark.apache.org
> 
> Hi,
> I am using spark sql in a way like this:
> 
> sqlContext.sql(“select * from table limit 1”).map(...).collect()
> 
> The problem is that the limit clause will collect all the 10,000 records into 
> a single partition, resulting the map afterwards running only in one 
> partition and being really slow.I tried to use repartition, but it is kind of 
> a waste to collect all those records into one partition and then shuffle them 
> around and then collect them again.
> 
> Is there a way to work around this? 
> BTW, there is no order by clause and I do not care which 1 records I get 
> as long as the total number is less or equal then 1.