Re: Spark Data Frame. PreSorded partitions

2017-11-28 Thread Michael Artz
I'm not sure other than retrieving from a hive table that is already
sorted.  This sounds cool though, would be interested to know this as well

On Nov 28, 2017 10:40 AM, "Николай Ижиков"  wrote:

> Hello, guys!
>
> I work on implementation of custom DataSource for Spark Data Frame API and
> have a question:
>
> If I have a `SELECT * FROM table1 ORDER BY some_column` query I can sort
> data inside a partition in my data source.
>
> Do I have a built-in option to tell spark that data from each partition
> already sorted?
>
> It seems that Spark can benefit from usage of already sorted partitions.
> By using of distributed merge sort algorithm, for example.
>
> Does it make sense for you?
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: Spark Data Frame Writer - Range Partiotioning

2017-07-25 Thread Jain, Nishit
But wouldn’t partitioning column partition the data only in Spark RDD? Would it 
also partition columns at disk when data is written (diving data in folders)?

From: ayan guha mailto:guha.a...@gmail.com>>
Date: Friday, July 21, 2017 at 3:25 PM
To: "Jain, Nishit" mailto:nja...@underarmour.com>>, 
"user@spark.apache.org<mailto:user@spark.apache.org>" 
mailto:user@spark.apache.org>>
Subject: Re: Spark Data Frame Writer - Range Partiotioning

How about creating a partituon column and use it?

On Sat, 22 Jul 2017 at 2:47 am, Jain, Nishit 
mailto:nja...@underarmour.com>> wrote:

Is it possible to have Spark Data Frame Writer write based on RangePartioning?

For Ex -

I have 10 distinct values for column_a, say 1 to 10.

df.write
.partitionBy("column_a")


Above code by default will create 10 folders .. column_a=1,column_a=2 
...column_a=10

I want to see if it is possible to have these partitions based on bucket - 
col_a=1to5, col_a=5-10 .. or something like that? Then also have query engine 
respect it

Thanks,

Nishit

--
Best Regards,
Ayan Guha


Re: Spark Data Frame Writer - Range Partiotioning

2017-07-21 Thread ayan guha
How about creating a partituon column and use it?

On Sat, 22 Jul 2017 at 2:47 am, Jain, Nishit  wrote:

> Is it possible to have Spark Data Frame Writer write based on
> RangePartioning?
>
> For Ex -
>
> I have 10 distinct values for column_a, say 1 to 10.
>
> df.write
> .partitionBy("column_a")
>
> Above code by default will create 10 folders .. column_a=1,column_a=2
> ...column_a=10
>
> I want to see if it is possible to have these partitions based on bucket -
> col_a=1to5, col_a=5-10 .. or something like that? Then also have query
> engine respect it
>
> Thanks,
>
> Nishit
>
-- 
Best Regards,
Ayan Guha


Re: Spark data frame map problem

2017-03-22 Thread Yan Facai
Could you give more details of your code?



On Wed, Mar 22, 2017 at 2:40 AM, Shashank Mandil 
wrote:

> Hi All,
>
> I have a spark data frame which has 992 rows inside it.
> When I run a map on this data frame I expect that the map should work for
> all the 992 rows.
>
> As a mapper runs on an executor on  a cluster I did a distributed count of
> the number of rows the mapper is being run on.
>
> dataframe.map(r => {
>//distributed count inside here using zookeeper
> })
>
> I have found that this distributed count inside the mapper is not exactly
> 992. I have found this number to vary with different runs.
>
> Does anybody have any idea what might be happening ? By the way, I am
> using spark 1.6.1
>
> Thanks,
> Shashank
>
>


Re: Spark data frame

2015-12-22 Thread Dean Wampler
More specifically, you could have TBs of data across thousands of
partitions for a single RDD. If you call collect(), BOOM!

Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
<http://shop.oreilly.com/product/0636920033073.do> (O'Reilly)
Typesafe <http://typesafe.com>
@deanwampler <http://twitter.com/deanwampler>
http://polyglotprogramming.com

On Tue, Dec 22, 2015 at 4:20 PM, Silvio Fiorito <
silvio.fior...@granturing.com> wrote:

> Michael,
>
> collect will bring down the results to the driver JVM, whereas the RDD or
> DataFrame would be cached on the executors (if it is cached). So, as Dean
> said, the driver JVM needs to have enough memory to store the results of
> collect.
>
> Thanks,
> Silvio
>
> From: Michael Segel 
> Date: Tuesday, December 22, 2015 at 4:26 PM
> To: Dean Wampler 
> Cc: Gaurav Agarwal , "user@spark.apache.org" <
> user@spark.apache.org>
> Subject: Re: Spark data frame
>
> Dean,
>
> RDD in memory and then the collect() resulting in a collection, where both
> are alive at the same time.
> (Again not sure how Tungsten plays in to this… )
>
> So his collection can’t be larger than 1/2 of the memory allocated to the
> heap.
>
> (Unless you have allocated swap…, right?)
>
> On Dec 22, 2015, at 12:11 PM, Dean Wampler  wrote:
>
> You can call the collect() method to return a collection, but be careful.
> If your data is too big to fit in the driver's memory, it will crash.
>
> Dean Wampler, Ph.D.
> Author: Programming Scala, 2nd Edition
> <http://shop.oreilly.com/product/0636920033073.do> (O'Reilly)
> Typesafe <http://typesafe.com/>
> @deanwampler <http://twitter.com/deanwampler>
> http://polyglotprogramming.com
>
> On Tue, Dec 22, 2015 at 1:09 PM, Gaurav Agarwal 
> wrote:
>
>> We are able to retrieve data frame by filtering the rdd object . I need
>> to convert that data frame into java pojo. Any idea how to do that
>>
>
>
>


Re: Spark data frame

2015-12-22 Thread Silvio Fiorito
Michael,

collect will bring down the results to the driver JVM, whereas the RDD or 
DataFrame would be cached on the executors (if it is cached). So, as Dean said, 
the driver JVM needs to have enough memory to store the results of collect.

Thanks,
Silvio

From: Michael Segel 
mailto:msegel_had...@hotmail.com>>
Date: Tuesday, December 22, 2015 at 4:26 PM
To: Dean Wampler mailto:deanwamp...@gmail.com>>
Cc: Gaurav Agarwal mailto:gaurav130...@gmail.com>>, 
"user@spark.apache.org<mailto:user@spark.apache.org>" 
mailto:user@spark.apache.org>>
Subject: Re: Spark data frame

Dean,

RDD in memory and then the collect() resulting in a collection, where both are 
alive at the same time.
(Again not sure how Tungsten plays in to this… )

So his collection can’t be larger than 1/2 of the memory allocated to the heap.

(Unless you have allocated swap…, right?)

On Dec 22, 2015, at 12:11 PM, Dean Wampler 
mailto:deanwamp...@gmail.com>> wrote:

You can call the collect() method to return a collection, but be careful. If 
your data is too big to fit in the driver's memory, it will crash.

Dean Wampler, Ph.D.
Author: Programming Scala, 2nd 
Edition<http://shop.oreilly.com/product/0636920033073.do> (O'Reilly)
Typesafe<http://typesafe.com/>
@deanwampler<http://twitter.com/deanwampler>
http://polyglotprogramming.com<http://polyglotprogramming.com/>

On Tue, Dec 22, 2015 at 1:09 PM, Gaurav Agarwal 
mailto:gaurav130...@gmail.com>> wrote:

We are able to retrieve data frame by filtering the rdd object . I need to 
convert that data frame into java pojo. Any idea how to do that




Re: Spark data frame

2015-12-22 Thread Dean Wampler
You can call the collect() method to return a collection, but be careful.
If your data is too big to fit in the driver's memory, it will crash.

Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
 (O'Reilly)
Typesafe 
@deanwampler 
http://polyglotprogramming.com

On Tue, Dec 22, 2015 at 1:09 PM, Gaurav Agarwal 
wrote:

> We are able to retrieve data frame by filtering the rdd object . I need to
> convert that data frame into java pojo. Any idea how to do that
>


Re: spark data frame write.mode("append") bug

2015-12-12 Thread Michael Armbrust
If you want to contribute to the project open a JIRA/PR:
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark

On Sat, Dec 12, 2015 at 3:13 AM, kali.tumm...@gmail.com <
kali.tumm...@gmail.com> wrote:

> Hi All,
>
>
> https://github.com/apache/spark/blob/branch-1.5/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala#L48
>
> In Present spark version in line 48 there is a bug, to check whether table
> exists in a database using limit doesnt work for all databases sql server
> for example.
>
> best way to check whehter table exists in any database is to use, select *
> from table where 1=2;  or select 1 from table where 1=2; this supports all
> the databases.
>
> In spark 1.6 can this change be implemented, this lets
> write.mode("append")
> bug to go away.
>
>
>
> def tableExists(conn: Connection, table: String): Boolean = {
>
> // Somewhat hacky, but there isn't a good way to identify whether a
> table exists for all
> // SQL database systems, considering "table" could also include the
> database name.
> Try(conn.prepareStatement(s"SELECT 1 FROM $table LIMIT
> 1").executeQuery().next()).isSuccess
>   }
>
> Solution:-
> def tableExists(conn: Connection, table: String): Boolean = {
>
> // Somewhat hacky, but there isn't a good way to identify whether a
> table exists for all
> // SQL database systems, considering "table" could also include the
> database name.
> Try(conn.prepareStatement(s"SELECT 1 FROM $table where
> 1=2").executeQuery().next()).isSuccess
>   }
>
>
>
> Thanks
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/spark-data-frame-write-mode-append-bug-tp25650p25693.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: spark data frame write.mode("append") bug

2015-12-12 Thread sri hari kali charan Tummala
Hi All,

https://github.com/apache/spark/blob/branch-1.5/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala#L48

In Present spark version in line 48 there is a bug, to check whether table
exists in a database using limit doesnt work for all databases sql server
for example.

best way to check whehter table exists in any database is to use, select *
from table where 1=2;  or select 1 from table where 1=2; this supports all
the databases.

In spark 1.6 can this change be implemented, this lets
 write.mode("append") bug to go away.



def tableExists(conn: Connection, table: String): Boolean = {

// Somewhat hacky, but there isn't a good way to identify whether a
table exists for all
// SQL database systems, considering "table" could also include the
database name.
Try(conn.prepareStatement(s"SELECT 1 FROM $table LIMIT
1").executeQuery().next()).isSuccess
  }

Solution:-
def tableExists(conn: Connection, table: String): Boolean = {

// Somewhat hacky, but there isn't a good way to identify whether a
table exists for all
// SQL database systems, considering "table" could also include the
database name.
Try(conn.prepareStatement(s"SELECT 1 FROM $table where
1=2").executeQuery().next()).isSuccess
  }



Thanks

On Wed, Dec 9, 2015 at 4:24 PM, Seongduk Cheon  wrote:

> Not for sure, but I think it is bug as of 1.5.
>
> Spark is using LIMIT keyword whether a table exists.
>
> https://github.com/apache/spark/blob/branch-1.5/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala#L48
>
> If your database does not support LIMIT keyword such as SQL Server, spark
> try to create table
>
> https://github.com/apache/spark/blob/branch-1.5/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala#L272-L275
>
> This issue has already fixed and It will be released on 1.6
> https://issues.apache.org/jira/browse/SPARK-9078
>
>
> --
> Cheon
>
> 2015-12-09 22:54 GMT+09:00 kali.tumm...@gmail.com 
> :
>
>> Hi Spark Contributors,
>>
>> I am trying to append data  to target table using df.write.mode("append")
>> functionality but spark throwing up table already exists exception.
>>
>> Is there a fix scheduled in later spark release ?, I am using spark 1.5.
>>
>> val sourcedfmode=sourcedf.write.mode("append")
>> sourcedfmode.jdbc(TargetDBinfo.url,TargetDBinfo.table,targetprops)
>>
>> Full Code:-
>>
>> https://github.com/kali786516/ScalaDB/blob/master/src/main/java/com/kali/db/SaprkSourceToTargetBulkLoad.scala
>>
>> Spring Config File:-
>>
>> https://github.com/kali786516/ScalaDB/blob/master/src/main/resources/SourceToTargetBulkLoad.xml
>>
>>
>> Thanks
>> Sri
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/spark-data-frame-write-mode-append-bug-tp25650.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>
>


-- 
Thanks & Regards
Sri Tummala


Re: spark data frame write.mode("append") bug

2015-12-12 Thread kali.tumm...@gmail.com
Hi All, 

https://github.com/apache/spark/blob/branch-1.5/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala#L48

In Present spark version in line 48 there is a bug, to check whether table
exists in a database using limit doesnt work for all databases sql server
for example.

best way to check whehter table exists in any database is to use, select *
from table where 1=2;  or select 1 from table where 1=2; this supports all
the databases.

In spark 1.6 can this change be implemented, this lets  write.mode("append")
bug to go away.



def tableExists(conn: Connection, table: String): Boolean = {

// Somewhat hacky, but there isn't a good way to identify whether a
table exists for all
// SQL database systems, considering "table" could also include the
database name.
Try(conn.prepareStatement(s"SELECT 1 FROM $table LIMIT
1").executeQuery().next()).isSuccess
  }

Solution:-
def tableExists(conn: Connection, table: String): Boolean = {

// Somewhat hacky, but there isn't a good way to identify whether a
table exists for all
// SQL database systems, considering "table" could also include the
database name.
Try(conn.prepareStatement(s"SELECT 1 FROM $table where
1=2").executeQuery().next()).isSuccess
  }



Thanks



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark-data-frame-write-mode-append-bug-tp25650p25693.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: spark data frame write.mode("append") bug

2015-12-09 Thread Seongduk Cheon
Not for sure, but I think it is bug as of 1.5.

Spark is using LIMIT keyword whether a table exists.
https://github.com/apache/spark/blob/branch-1.5/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala#L48

If your database does not support LIMIT keyword such as SQL Server, spark
try to create table
https://github.com/apache/spark/blob/branch-1.5/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala#L272-L275

This issue has already fixed and It will be released on 1.6
https://issues.apache.org/jira/browse/SPARK-9078


--
Cheon

2015-12-09 22:54 GMT+09:00 kali.tumm...@gmail.com :

> Hi Spark Contributors,
>
> I am trying to append data  to target table using df.write.mode("append")
> functionality but spark throwing up table already exists exception.
>
> Is there a fix scheduled in later spark release ?, I am using spark 1.5.
>
> val sourcedfmode=sourcedf.write.mode("append")
> sourcedfmode.jdbc(TargetDBinfo.url,TargetDBinfo.table,targetprops)
>
> Full Code:-
>
> https://github.com/kali786516/ScalaDB/blob/master/src/main/java/com/kali/db/SaprkSourceToTargetBulkLoad.scala
>
> Spring Config File:-
>
> https://github.com/kali786516/ScalaDB/blob/master/src/main/resources/SourceToTargetBulkLoad.xml
>
>
> Thanks
> Sri
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/spark-data-frame-write-mode-append-bug-tp25650.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>