unsubscribe
- To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: Fastest way to drop useless columns
I believe this only works when we need to drop duplicate ROWS Here I want to drop cols which contains one unique value Le 2018-05-31 11:16, Divya Gehlot a écrit : you can try dropduplicate function https://github.com/spirom/LearningSpark/blob/master/src/main/scala/dataframe/DropDuplicates.scala On 31 May 2018 at 16:34, wrote: Hi there ! I have a potentially large dataset ( regarding number of rows and cols ) And I want to find the fastest way to drop some useless cols for me, i.e. cols containing only an unique value ! I want to know what do you think that I could do to do this as fast as possible using spark. I already have a solution using distinct().count() or approxCountDistinct() But, they may not be the best choice as this requires to go through all the data, even if the 2 first tested values for a col are already different ( and in this case I know that I can keep the col ) Thx for your ideas ! Julien - To unsubscribe e-mail: user-unsubscr...@spark.apache.org - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Fastest way to drop useless columns
Hi there ! I have a potentially large dataset ( regarding number of rows and cols ) And I want to find the fastest way to drop some useless cols for me, i.e. cols containing only an unique value ! I want to know what do you think that I could do to do this as fast as possible using spark. I already have a solution using distinct().count() or approxCountDistinct() But, they may not be the best choice as this requires to go through all the data, even if the 2 first tested values for a col are already different ( and in this case I know that I can keep the col ) Thx for your ideas ! Julien - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Feature generation / aggregate functions / timeseries
Hi dear spark community ! I want to create a lib which generates features for potentially very large datasets, so I believe spark could be a nice tool for that. Let me explain what I need to do : Each file 'F' of my dataset is composed of at least : - an id ( string or int ) - a timestamp ( or a long value ) - a value ( generaly a double ) I want my tool to : - compute aggregate function for many pairs 'instants + duration' ===> FOR EXAMPLE : = compute for the instant 't = 2001-01-01' aggregate functions for data between 't-1month and t' and 't-12months and t-9months' and this, FOR EACH ID ! ( aggregate functions such as min/max/count/distinct/last/mode/kurtosis... or even user defined ! ) My constraints : - I don't want to compute aggregate for each tuple of 'F' ---> I want to provide a list of couples 'instants + duration' ( potentially large ) - My 'window' defined by the duration may be really large ( but may contain only a few values... ) - I may have many id... - I may have many timestamps... Let me describe this with some kind of example to see if SPARK ( SPARK STREAMING ? ) may help me to do that : Let's imagine that I have all my data in a DB or a file with the following columns : id | timestamp(ms) | value A | 100 | 100 A | 1000500 | 66 B | 100 | 100 B | 110 | 50 B | 120 | 200 B | 250 | 500 ( The timestamp is a long value, so as to be able to express date in ms from -01-01. to today ) I want to compute operations such as min, max, average, last on the value column, for a these couples : -> instant = 1000500 / [-1000ms, 0 ] ( i.e. : aggregate data between [ t-1000ms and t ] -> instant = 133 / [-5000ms, -2500 ] ( i.e. : aggregate data between [ t-5000ms and t-2500ms ] And this will produce this kind of output : id | timestamp(ms) | min_value | max_value | avg_value | last_value --- A | 1000500| min...| max | avg | last B | 1000500| min...| max | avg | last A | 133| min...| max | avg | last B | 133| min...| max | avg | last Do you think we can do this efficiently with spark and/or spark streaming, and do you have an idea on "how" ? ( I have tested some solutions but I'm not really satisfied ATM... ) Thanks a lot Community :) - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Fwd: Feature Generation for Large datasets composed of many time series
Hi dear spark community ! I want to create a lib which generates features for potentially very large datasets, so I believe spark could be a nice tool for that. Let me explain what I need to do : Each file 'F' of my dataset is composed of at least : - an id ( string or int ) - a timestamp ( or a long value ) - a value ( generaly a double ) I want my tool to : - compute aggregate function for many pairs 'instants + duration' ===> FOR EXAMPLE : = compute for the instant 't = 2001-01-01' aggregate functions for data between 't-1month and t' and 't-12months and t-9months' and this, FOR EACH ID ! ( aggregate functions such as min/max/count/distinct/last/mode/kurtosis... or even user defined ! ) My constraints : - I don't want to compute aggregate for each tuple of 'F' ---> I want to provide a list of couples 'instants + duration' ( potentially large ) - My 'window' defined by the duration may be really large ( but may contain only a few values... ) - I may have many id... - I may have many timestamps... Let me describe this with some kind of example to see if SPARK ( SPARK STREAMING ? ) may help me to do that : Let's imagine that I have all my data in a DB or a file with the following columns : id | timestamp(ms) | value A | 100 | 100 A | 1000500 | 66 B | 100 | 100 B | 110 | 50 B | 120 | 200 B | 250 | 500 ( The timestamp is a long value, so as to be able to express date in ms from -01-01. to today ) I want to compute operations such as min, max, average, last on the value column, for a these couples : -> instant = 1000500 / [-1000ms, 0 ] ( i.e. : aggregate data between [ t-1000ms and t ] -> instant = 133 / [-5000ms, -2500 ] ( i.e. : aggregate data between [ t-5000ms and t-2500ms ] And this will produce this kind of output : id | timestamp(ms) | min_value | max_value | avg_value | last_value --- A | 1000500| min...| max | avg | last B | 1000500| min...| max | avg | last A | 133| min...| max | avg | last B | 133| min...| max | avg | last Do you think we can do this efficiently with spark and/or spark streaming, and do you have an idea on "how" ? ( I have tested some solutions but I'm not really satisfied ATM... ) Thanks a lot Community :) - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: Feature Generation for Large datasets composed of many time series
Ok thanks ! That's exactly the kind of thing I was imagining with Apache BEAM. I still have a few questions. - regarding performances will this be efficient ? Even with large "window" / many id / values / timestamps ... ? - my goal after all this is to store it in cassandra and/or use the final dataset with Apache SPARK. Will it be easy to do this ? Thanks again Lukasz ! Le 2017-07-23 20:42, Lukasz Cwik a écrit : You can do this efficiently with Apache Beam but you would need to write code which converts a users expression into a set of PTransforms or create a few pipeline variants for commonly computed outcomes. There are already many transforms which can compute things like min, max, average. Take a look at the javadoc[1]. It seems like you would want to structure your pipeline like: ReadFromFile -> FilterRecordsBasedUponTimestamp -> Min.perKey/Max.perKey/Average.perKey/... -> OutputToFile It doesn't seem like windowing/triggers would provide you much value based upon what you describe. Also, it sounds like you would be interested in the SQL development that is ongoing which would allow users to write these kinds of queries without needing to write a complicated pipeline. The feature branch[2] is looking to be merged into master soon and become part of the next release. 1: https://beam.apache.org/documentation/sdks/javadoc/2.0.0/index.html?org/apache/beam/sdk/transforms/Min.html 2: https://github.com/apache/beam/tree/DSL_SQL On Wed, Jul 19, 2017 at 4:31 AM,wrote: Hello, I want to create a lib which generates features for potentially very large datasets. Each file 'F' of my dataset is composed of at least : - an id ( string or int ) - a timestamp ( or a long value ) - a value ( int or string ) I want my tool to : - compute aggregate function for many couple 'instants + duration' ===> FOR EXAMPLE : = compute for the instant 't = 2001-01-01' aggregate functions for data between 't-1month and t' and 't-12months and t-9months' and this, FOR EACH ID ! ( aggregate function such as min/max/count/distinct/last/mode or user defined ) My constraints : - I don't want to compute aggregate for each tuple of 'F' ---> I want to provide a list of couples 'instants + duration' ( potentially large ) - My 'window' defined by the duration may be really large ( but may contain only a few values... ) - I may have many id... - I may have many timestamps... Let me describe this with some kind of example to see if Apache Beam may help me to do that : Let's imagine that I have all my data in a DB or a file with the following columns : id | timestamp(ms) | value A | 100 | 100 A | 1000500 | 66 B | 100 | 100 B | 110 | 50 B | 120 | 200 B | 250 | 500 ( The timestamp is a long value, so as to be able to express date in ms from -01-01 to today ) I want to compute operations such as min, max, average, last on the value column, for a these couples : -> instant = 1000500 / [-1000ms, 0 ] ( i.e. : agg. data betweem [ t-1000ms and t ] -> instant = 133 / [-5000ms, -2500 ] ( i.e. : agg. data betweem [ t-5000ms and t-2500ms ] And this will produce this kind of output : id | timestamp(ms) | min_value | max_value | avg_value | last_value --- A | 1000500| min...| max | avg | last B | 1000500| min...| max | avg | last A | 133| min...| max | avg | last B | 133| min...| max | avg | last Do you think we can do this efficiently with Apache beam, and do you have an idea on "how" ? Thanks a lot
Union large number of DataFrames
Hi there ! Let's imagine I have a large number of very small dataframe with the same schema ( a list of DataFrames : allDFs) and I want to create one large dataset with this. I have been trying this : -> allDFs.reduce ( (a,b) => a.union(b) ) And after this one : -> allDFs.reduce ( (a,b) => a.union(b).repartition(200) ) to prevent df with large number of partitions Two questions : 1) Will the reduce operation be done in parallel in the previous code ? or may be should I replace my reduce by allDFs.par.reduce ? 2) Is there a better way to concatenate them ? Thanks ! Julio - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Feature Generation for Large datasets composed of many time series
Hello, I want to create a lib which generates features for potentially very large datasets. Each file 'F' of my dataset is composed of at least : - an id ( string or int ) - a timestamp ( or a long value ) - a value ( int or string ) I want my tool to : - compute aggregate function for many couple 'instants + duration' ===> FOR EXAMPLE : = compute for the instant 't = 2001-01-01' aggregate functions for data between 't-1month and t' and 't-12months and t-9months' and this, FOR EACH ID ! ( aggregate function such as min/max/count/distinct/last/mode or user defined ) My constraints : - I don't want to compute aggregate for each tuple of 'F' ---> I want to provide a list of couples 'instants + duration' ( potentially large ) - My 'window' defined by the duration may be really large ( but may contain only a few values... ) - I may have many id... - I may have many timestamps... Let me describe this with some kind of example to see if Apache Beam may help me to do that : Let's imagine that I have all my data in a DB or a file with the following columns : id | timestamp(ms) | value A | 100 | 100 A | 1000500 | 66 B | 100 | 100 B | 110 | 50 B | 120 | 200 B | 250 | 500 ( The timestamp is a long value, so as to be able to express date in ms from -01-01 to today ) I want to compute operations such as min, max, average, last on the value column, for a these couples : -> instant = 1000500 / [-1000ms, 0 ] ( i.e. : agg. data betweem [ t-1000ms and t ] -> instant = 133 / [-5000ms, -2500 ] ( i.e. : agg. data betweem [ t-5000ms and t-2500ms ] And this will produce this kind of output : id | timestamp(ms) | min_value | max_value | avg_value | last_value --- A | 1000500| min...| max | avg | last B | 1000500| min...| max | avg | last A | 133| min...| max | avg | last B | 133| min...| max | avg | last Do you think we can do this efficiently with Apache beam, and do you have an idea on "how" ? Thanks a lot
Feature Generation for Large datasets composed of many time series
Hello, I want to create a lib which generates features for potentially very large datasets. Each file 'F' of my dataset is composed of at least : - an id ( string or int ) - a timestamp ( or a long value ) - a value ( int or string ) I want my tool to : - compute aggregate function for many couple 'instants + duration' ===> FOR EXAMPLE : = compute for the instant 't = 2001-01-01' aggregate functions for data between 't-1month and t' and 't-12months and t-9months' and this, FOR EACH ID ! ( aggregate function such as min/max/count/distinct/last/mode or user defined ) My constraints : - I don't want to compute aggregate for each tuple of 'F' ---> I want to provide a list of couples 'instants + duration' ( potentially large ) - My 'window' defined by the duration may be really large ( but may contain only a few values... ) - I may have many id... - I may have many timestamps... Let me describe this with some kind of example to see if SPARK ( SPARK STREAMING ? ) may help me to do that : Let's imagine that I have all my data in a DB or a file with the following columns : id | timestamp(ms) | value A | 100 | 100 A | 1000500 | 66 B | 100 | 100 B | 110 | 50 B | 120 | 200 B | 250 | 500 ( The timestamp is a long value, so as to be able to express date in ms from -01-01 to today ) I want to compute operations such as min, max, average, last on the value column, for a these couples : -> instant = 1000500 / [-1000ms, 0 ] ( i.e. : agg. data betweem [ t-1000ms and t ] -> instant = 133 / [-5000ms, -2500 ] ( i.e. : agg. data betweem [ t-5000ms and t-2500ms ] And this will produce this kind of output : id | timestamp(ms) | min_value | max_value | avg_value | last_value --- A | 1000500| min...| max | avg | last B | 1000500| min...| max | avg | last A | 133| min...| max | avg | last B | 133| min...| max | avg | last Do you think we can do this efficiently with spark and/or spark streaming, and do you have an idea on "how" ? Thanks a lot ! - To unsubscribe e-mail: user-unsubscr...@spark.apache.org