RE: [julia-users] Filtering DataFrame with a function

2016-10-13 Thread Júlio Hoffimann
That is really cool David, I fully agree with this modularization :)

-Júlio


RE: [julia-users] Filtering DataFrame with a function

2016-10-13 Thread David Anthoff
Yes, I think the general plan of the DataFrames maintainers is to largely rely 
on packages like Query and StructuredQueries for data manipulation.

 

There is another benefit of having this kind of query infrastructure in its own 
package: all the query operations that I showed that use Query also work 
against any other data source, i.e. you can use those with arrays, 
dictionaries, directly with a CSV source (and everything will be streamed, no 
allocations of intermediates!), TypedTables, IndexedTables and many more.

 

Cheers,

David

 

From: karbar...@gmail.com [mailto:karbar...@gmail.com] On Behalf Of Jacob Quinn
Sent: Wednesday, October 12, 2016 11:14 PM
To: julia-users@googlegroups.com
Subject: Re: [julia-users] Filtering DataFrame with a function

 

I think the Julia ecosystem is evolving tremendously in this respect. I think 
originally, there were a lot of these "mammoth" packages that tried to provide 
everything and the kitchen sink. Unfortunately, this has led to package bloat, 
package inefficiencies in terms of load times and installation, and 
unmaintainability. DataFrames and Gadfly are great examples.

 

The trend more recently has been a rededication to small, modular packages that 
interopt nicely with others. This means moving things **out** of packages that 
aren't totally essential: or in the case of DataFrames, that can include things 
like IO (CSV.jl), data manipulation (Query.jl and StructuredQuery.jl), and 
others.

 

Ultimately, with the help of core languages features like 
(https://github.com/JuliaLang/julia/issues/15705), I think we'll continue to 
see packages slim down. This, of course, opens up more possibilities in the 
future for so-called "meta" packages that could bundle several packages 
together. These "meta" packages are then essentially tasked with tracking 
versions, dependencies, and so forth while individual packages can focus on 
simple, solid code.

 

-Jacob

 

 

On Wed, Oct 12, 2016 at 11:20 PM, Júlio Hoffimann mailto:julio.hoffim...@gmail.com> > wrote:

Thank you very Much David, these queries you showed are really nice. I meant 
that ideally I wouldn't need to install another package for a simple filter 
operation on the rows.

 

-Júlio

 

2016-10-12 22:14 GMT-07:00 mailto:anth...@berkeley.edu> 
>:

Were you worried about Query being not lightweight enough in terms of overhead, 
or in terms of syntax?

 

I just added a more lightweight syntax for this scenario to Query. You can now 
do the following two things:

 

q = @where(df, i->i.price > 30.)

 

that will return a filtered iterator. You can materialize that into a DataFrame 
with collect(q, DataFrame).

 

I also added a counting option. Turns out that is actually a LINQ query 
operator, and the goal is to implement all of those in Query. The syntax is 
simple:

 

@count(df, i->i.price > 30.)

 

returns the number of rows for which the filter condition is true.

 

Under the hood both of these new syntax options use the normal Query machinery, 
this just provides a simpler syntax relative to the more elaborate things I've 
posted earlier. In terms of LINQ, this corresponds to the method invocation API 
that LINQ has. I'm still figuring out how to surface something like @count in 
the query expression syntax, but for now one can use it via this macro.

 

All of this is on master right now, so you would have to do 
Pkg.checkout("Query") to get these macros.

 

Best,

David


On Wednesday, October 12, 2016 at 6:47:15 PM UTC-7, Júlio Hoffimann wrote:

Hi David,

 

Thank you for your elaborated answer and for writing a package for general 
queries, that is great! I will keep the package in mind if I need something 
more complex.

 

I am currently looking for a lightweight solution within DataFrames, filtering 
is a very common operation. Right now, I am considering converting the 
DataFrame to an array and looping over the rows. I wonder if there is a 
syntactic sugar for this loop.

 

-Júlio

 

2016-10-12 17:48 GMT-07:00 David Anthoff < <mailto:ant...@berkeley.edu> 
ant...@berkeley.edu>:

Hi Julio,

 

you can use the Query package for the first part. To filter a DataFrame using 
some arbitrary julia expression, use something like this:

 

using DataFrames, Query, NamedTuples

 

q = @from i in df begin

@where 

@select i

end

 

You can use any julia code in . Say your DataFrame has a 
column called price, then you could filter like this:

 

@where i.price > 30.

 

The i will be a NamedTuple type, so you can access the columns either by their 
name, or also by their index, e.g.

 

@where i[1] > 30.

 

if you want to filter by the first column. You can also just call some function 
that you have defined somewhere else:

 

@where foo(i)

 

As long as the  returns a Bool, you should be good.

 

If you run a query like this, q will be a standard julia iterator. 

Re: [julia-users] Filtering DataFrame with a function

2016-10-13 Thread Júlio Hoffimann
Hi Alex,

That is closer to what I had in mind originally, but I actually solved the
problem by reorganizing my algorithm to avoid filters.

Thank you,
-Júlio

2016-10-13 8:21 GMT-07:00 Alex Mellnik :

> Hi Júlio,
>
> If you're just interested in using an arbitrary function to filter on rows
> you can do something like:
>
> df = DataFrame(Fish = ["Amir", "Betty", "Clyde"], Mass = [1.2, 3.3, 0.4])
> filter(row) = (row[:Fish][1] != "A")&(row[:Mass]>1)
> df = df[[filter(r) for r in eachrow(df)],:]
>
> Is that what you're looking for?  If not, can you give an example of what
> you want to do?
>
> Best,
>
> Alex
>
> On Wednesday, October 12, 2016 at 10:20:52 PM UTC-7, Júlio Hoffimann wrote:
>>
>> Thank you very Much David, these queries you showed are really nice. I
>> meant that ideally I wouldn't need to install another package for a simple
>> filter operation on the rows.
>>
>> -Júlio
>>
>> 2016-10-12 22:14 GMT-07:00 :
>>
>>> Were you worried about Query being not lightweight enough in terms of
>>> overhead, or in terms of syntax?
>>>
>>> I just added a more lightweight syntax for this scenario to Query. You
>>> can now do the following two things:
>>>
>>> q = @where(df, i->i.price > 30.)
>>>
>>> that will return a filtered iterator. You can materialize that into a
>>> DataFrame with collect(q, DataFrame).
>>>
>>> I also added a counting option. Turns out that is actually a LINQ query
>>> operator, and the goal is to implement all of those in Query. The syntax is
>>> simple:
>>>
>>> @count(df, i->i.price > 30.)
>>>
>>> returns the number of rows for which the filter condition is true.
>>>
>>> Under the hood both of these new syntax options use the normal Query
>>> machinery, this just provides a simpler syntax relative to the more
>>> elaborate things I've posted earlier. In terms of LINQ, this corresponds to
>>> the method invocation API that LINQ has. I'm still figuring out how to
>>> surface something like @count in the query expression syntax, but for now
>>> one can use it via this macro.
>>>
>>> All of this is on master right now, so you would have to do
>>> Pkg.checkout("Query") to get these macros.
>>>
>>> Best,
>>> David
>>>
>>> On Wednesday, October 12, 2016 at 6:47:15 PM UTC-7, Júlio Hoffimann
>>> wrote:

 Hi David,

 Thank you for your elaborated answer and for writing a package for
 general queries, that is great! I will keep the package in mind if I need
 something more complex.

 I am currently looking for a lightweight solution within DataFrames,
 filtering is a very common operation. Right now, I am considering
 converting the DataFrame to an array and looping over the rows. I wonder if
 there is a syntactic sugar for this loop.

 -Júlio

 2016-10-12 17:48 GMT-07:00 David Anthoff :

> Hi Julio,
>
>
>
> you can use the Query package for the first part. To filter a
> DataFrame using some arbitrary julia expression, use something like this:
>
>
>
> using DataFrames, Query, NamedTuples
>
>
>
> q = @from i in df begin
>
> @where 
>
> @select i
>
> end
>
>
>
> You can use any julia code in . Say your DataFrame
> has a column called price, then you could filter like this:
>
>
>
> @where i.price > 30.
>
>
>
> The i will be a NamedTuple type, so you can access the columns either
> by their name, or also by their index, e.g.
>
>
>
> @where i[1] > 30.
>
>
>
> if you want to filter by the first column. You can also just call some
> function that you have defined somewhere else:
>
>
>
> @where foo(i)
>
>
>
> As long as the  returns a Bool, you should be good.
>
>
>
> If you run a query like this, q will be a standard julia iterator.
> Right now you can’t just say length(q), although that is something I 
> should
> probably enable at some point (I’m also looking into the VB LINQ syntax
> that supports things like counting in the query expression itself).
>
>
>
> But you could materialize the query as an array and then look at the
> length of that:
>
>
>
> q = @from i in df begin
>
> @where 
>
> @select i
>
> @collect
>
> end
>
> count = length(q)
>
>
>
> The @collect statement means that the query will return an array of a
> NamedTuple type (you can also materialize it into a whole bunch of other
> data structures, take a look at the documentation).
>
>
>
> Let me know if this works, or if you have any other feedback on
> Query.jl, I’m much in need of some user feedback for the package at this
> point. Best way for that is to open issues here
> https://github.com/davidanthoff/Query.jl.
>
>
>
> Best,
>
> David
>
>
>
> *From:* j

Re: [julia-users] Filtering DataFrame with a function

2016-10-13 Thread Alex Mellnik
Hi Júlio,

If you're just interested in using an arbitrary function to filter on rows 
you can do something like:

df = DataFrame(Fish = ["Amir", "Betty", "Clyde"], Mass = [1.2, 3.3, 0.4])
filter(row) = (row[:Fish][1] != "A")&(row[:Mass]>1)
df = df[[filter(r) for r in eachrow(df)],:]

Is that what you're looking for?  If not, can you give an example of what 
you want to do?

Best,

Alex

On Wednesday, October 12, 2016 at 10:20:52 PM UTC-7, Júlio Hoffimann wrote:
>
> Thank you very Much David, these queries you showed are really nice. I 
> meant that ideally I wouldn't need to install another package for a simple 
> filter operation on the rows.
>
> -Júlio
>
> 2016-10-12 22:14 GMT-07:00 >:
>
>> Were you worried about Query being not lightweight enough in terms of 
>> overhead, or in terms of syntax?
>>
>> I just added a more lightweight syntax for this scenario to Query. You 
>> can now do the following two things:
>>
>> q = @where(df, i->i.price > 30.)
>>
>> that will return a filtered iterator. You can materialize that into a 
>> DataFrame with collect(q, DataFrame).
>>
>> I also added a counting option. Turns out that is actually a LINQ query 
>> operator, and the goal is to implement all of those in Query. The syntax is 
>> simple:
>>
>> @count(df, i->i.price > 30.)
>>
>> returns the number of rows for which the filter condition is true.
>>
>> Under the hood both of these new syntax options use the normal Query 
>> machinery, this just provides a simpler syntax relative to the more 
>> elaborate things I've posted earlier. In terms of LINQ, this corresponds to 
>> the method invocation API that LINQ has. I'm still figuring out how to 
>> surface something like @count in the query expression syntax, but for now 
>> one can use it via this macro.
>>
>> All of this is on master right now, so you would have to do 
>> Pkg.checkout("Query") to get these macros.
>>
>> Best,
>> David
>>
>> On Wednesday, October 12, 2016 at 6:47:15 PM UTC-7, Júlio Hoffimann wrote:
>>>
>>> Hi David,
>>>
>>> Thank you for your elaborated answer and for writing a package for 
>>> general queries, that is great! I will keep the package in mind if I need 
>>> something more complex.
>>>
>>> I am currently looking for a lightweight solution within DataFrames, 
>>> filtering is a very common operation. Right now, I am considering 
>>> converting the DataFrame to an array and looping over the rows. I wonder if 
>>> there is a syntactic sugar for this loop.
>>>
>>> -Júlio
>>>
>>> 2016-10-12 17:48 GMT-07:00 David Anthoff :
>>>
 Hi Julio,

  

 you can use the Query package for the first part. To filter a DataFrame 
 using some arbitrary julia expression, use something like this:

  

 using DataFrames, Query, NamedTuples

  

 q = @from i in df begin

 @where 

 @select i

 end

  

 You can use any julia code in . Say your DataFrame 
 has a column called price, then you could filter like this:

  

 @where i.price > 30.

  

 The i will be a NamedTuple type, so you can access the columns either 
 by their name, or also by their index, e.g.

  

 @where i[1] > 30.

  

 if you want to filter by the first column. You can also just call some 
 function that you have defined somewhere else:

  

 @where foo(i)

  

 As long as the  returns a Bool, you should be good.

  

 If you run a query like this, q will be a standard julia iterator. 
 Right now you can’t just say length(q), although that is something I 
 should 
 probably enable at some point (I’m also looking into the VB LINQ syntax 
 that supports things like counting in the query expression itself).

  

 But you could materialize the query as an array and then look at the 
 length of that:

  

 q = @from i in df begin

 @where 

 @select i

 @collect

 end

 count = length(q)

  

 The @collect statement means that the query will return an array of a 
 NamedTuple type (you can also materialize it into a whole bunch of other 
 data structures, take a look at the documentation).

  

 Let me know if this works, or if you have any other feedback on 
 Query.jl, I’m much in need of some user feedback for the package at this 
 point. Best way for that is to open issues here 
 https://github.com/davidanthoff/Query.jl.

  

 Best,

 David

  

 *From:* julia...@googlegroups.com [mailto:julia...@googlegroups.com] *On 
 Behalf Of *Júlio Hoffimann
 *Sent:* Wednesday, October 12, 2016 5:20 PM
 *To:* julia-users 
 *Subject:* [julia-users] Filtering DataFrame with a function

  

 Hi,

  

 I have a DataFrame for which I want to fi

Re: [julia-users] Filtering DataFrame with a function

2016-10-12 Thread Jacob Quinn
I think the Julia ecosystem is evolving tremendously in this respect. I
think originally, there were a lot of these "mammoth" packages that tried
to provide everything and the kitchen sink. Unfortunately, this has led to
package bloat, package inefficiencies in terms of load times and
installation, and unmaintainability. DataFrames and Gadfly are great
examples.

The trend more recently has been a rededication to small, modular packages
that interopt nicely with others. This means moving things **out** of
packages that aren't totally essential: or in the case of DataFrames, that
can include things like IO (CSV.jl), data manipulation (Query.jl and
StructuredQuery.jl), and others.

Ultimately, with the help of core languages features like (
https://github.com/JuliaLang/julia/issues/15705), I think we'll continue to
see packages slim down. This, of course, opens up more possibilities in the
future for so-called "meta" packages that could bundle several packages
together. These "meta" packages are then essentially tasked with tracking
versions, dependencies, and so forth while individual packages can focus on
simple, solid code.

-Jacob


On Wed, Oct 12, 2016 at 11:20 PM, Júlio Hoffimann  wrote:

> Thank you very Much David, these queries you showed are really nice. I
> meant that ideally I wouldn't need to install another package for a simple
> filter operation on the rows.
>
> -Júlio
>
> 2016-10-12 22:14 GMT-07:00 :
>
>> Were you worried about Query being not lightweight enough in terms of
>> overhead, or in terms of syntax?
>>
>> I just added a more lightweight syntax for this scenario to Query. You
>> can now do the following two things:
>>
>> q = @where(df, i->i.price > 30.)
>>
>> that will return a filtered iterator. You can materialize that into a
>> DataFrame with collect(q, DataFrame).
>>
>> I also added a counting option. Turns out that is actually a LINQ query
>> operator, and the goal is to implement all of those in Query. The syntax is
>> simple:
>>
>> @count(df, i->i.price > 30.)
>>
>> returns the number of rows for which the filter condition is true.
>>
>> Under the hood both of these new syntax options use the normal Query
>> machinery, this just provides a simpler syntax relative to the more
>> elaborate things I've posted earlier. In terms of LINQ, this corresponds to
>> the method invocation API that LINQ has. I'm still figuring out how to
>> surface something like @count in the query expression syntax, but for now
>> one can use it via this macro.
>>
>> All of this is on master right now, so you would have to do
>> Pkg.checkout("Query") to get these macros.
>>
>> Best,
>> David
>>
>> On Wednesday, October 12, 2016 at 6:47:15 PM UTC-7, Júlio Hoffimann wrote:
>>>
>>> Hi David,
>>>
>>> Thank you for your elaborated answer and for writing a package for
>>> general queries, that is great! I will keep the package in mind if I need
>>> something more complex.
>>>
>>> I am currently looking for a lightweight solution within DataFrames,
>>> filtering is a very common operation. Right now, I am considering
>>> converting the DataFrame to an array and looping over the rows. I wonder if
>>> there is a syntactic sugar for this loop.
>>>
>>> -Júlio
>>>
>>> 2016-10-12 17:48 GMT-07:00 David Anthoff :
>>>
 Hi Julio,



 you can use the Query package for the first part. To filter a DataFrame
 using some arbitrary julia expression, use something like this:



 using DataFrames, Query, NamedTuples



 q = @from i in df begin

 @where 

 @select i

 end



 You can use any julia code in . Say your DataFrame
 has a column called price, then you could filter like this:



 @where i.price > 30.



 The i will be a NamedTuple type, so you can access the columns either
 by their name, or also by their index, e.g.



 @where i[1] > 30.



 if you want to filter by the first column. You can also just call some
 function that you have defined somewhere else:



 @where foo(i)



 As long as the  returns a Bool, you should be good.



 If you run a query like this, q will be a standard julia iterator.
 Right now you can’t just say length(q), although that is something I should
 probably enable at some point (I’m also looking into the VB LINQ syntax
 that supports things like counting in the query expression itself).



 But you could materialize the query as an array and then look at the
 length of that:



 q = @from i in df begin

 @where 

 @select i

 @collect

 end

 count = length(q)



 The @collect statement means that the query will return an array of a
 NamedTuple type (you can also materialize it into a whole bunch of other
 data structures, take a look at the doc

Re: [julia-users] Filtering DataFrame with a function

2016-10-12 Thread Júlio Hoffimann
Thank you very Much David, these queries you showed are really nice. I
meant that ideally I wouldn't need to install another package for a simple
filter operation on the rows.

-Júlio

2016-10-12 22:14 GMT-07:00 :

> Were you worried about Query being not lightweight enough in terms of
> overhead, or in terms of syntax?
>
> I just added a more lightweight syntax for this scenario to Query. You can
> now do the following two things:
>
> q = @where(df, i->i.price > 30.)
>
> that will return a filtered iterator. You can materialize that into a
> DataFrame with collect(q, DataFrame).
>
> I also added a counting option. Turns out that is actually a LINQ query
> operator, and the goal is to implement all of those in Query. The syntax is
> simple:
>
> @count(df, i->i.price > 30.)
>
> returns the number of rows for which the filter condition is true.
>
> Under the hood both of these new syntax options use the normal Query
> machinery, this just provides a simpler syntax relative to the more
> elaborate things I've posted earlier. In terms of LINQ, this corresponds to
> the method invocation API that LINQ has. I'm still figuring out how to
> surface something like @count in the query expression syntax, but for now
> one can use it via this macro.
>
> All of this is on master right now, so you would have to do
> Pkg.checkout("Query") to get these macros.
>
> Best,
> David
>
> On Wednesday, October 12, 2016 at 6:47:15 PM UTC-7, Júlio Hoffimann wrote:
>>
>> Hi David,
>>
>> Thank you for your elaborated answer and for writing a package for
>> general queries, that is great! I will keep the package in mind if I need
>> something more complex.
>>
>> I am currently looking for a lightweight solution within DataFrames,
>> filtering is a very common operation. Right now, I am considering
>> converting the DataFrame to an array and looping over the rows. I wonder if
>> there is a syntactic sugar for this loop.
>>
>> -Júlio
>>
>> 2016-10-12 17:48 GMT-07:00 David Anthoff :
>>
>>> Hi Julio,
>>>
>>>
>>>
>>> you can use the Query package for the first part. To filter a DataFrame
>>> using some arbitrary julia expression, use something like this:
>>>
>>>
>>>
>>> using DataFrames, Query, NamedTuples
>>>
>>>
>>>
>>> q = @from i in df begin
>>>
>>> @where 
>>>
>>> @select i
>>>
>>> end
>>>
>>>
>>>
>>> You can use any julia code in . Say your DataFrame
>>> has a column called price, then you could filter like this:
>>>
>>>
>>>
>>> @where i.price > 30.
>>>
>>>
>>>
>>> The i will be a NamedTuple type, so you can access the columns either by
>>> their name, or also by their index, e.g.
>>>
>>>
>>>
>>> @where i[1] > 30.
>>>
>>>
>>>
>>> if you want to filter by the first column. You can also just call some
>>> function that you have defined somewhere else:
>>>
>>>
>>>
>>> @where foo(i)
>>>
>>>
>>>
>>> As long as the  returns a Bool, you should be good.
>>>
>>>
>>>
>>> If you run a query like this, q will be a standard julia iterator. Right
>>> now you can’t just say length(q), although that is something I should
>>> probably enable at some point (I’m also looking into the VB LINQ syntax
>>> that supports things like counting in the query expression itself).
>>>
>>>
>>>
>>> But you could materialize the query as an array and then look at the
>>> length of that:
>>>
>>>
>>>
>>> q = @from i in df begin
>>>
>>> @where 
>>>
>>> @select i
>>>
>>> @collect
>>>
>>> end
>>>
>>> count = length(q)
>>>
>>>
>>>
>>> The @collect statement means that the query will return an array of a
>>> NamedTuple type (you can also materialize it into a whole bunch of other
>>> data structures, take a look at the documentation).
>>>
>>>
>>>
>>> Let me know if this works, or if you have any other feedback on
>>> Query.jl, I’m much in need of some user feedback for the package at this
>>> point. Best way for that is to open issues here
>>> https://github.com/davidanthoff/Query.jl.
>>>
>>>
>>>
>>> Best,
>>>
>>> David
>>>
>>>
>>>
>>> *From:* julia...@googlegroups.com [mailto:julia...@googlegroups.com] *On
>>> Behalf Of *Júlio Hoffimann
>>> *Sent:* Wednesday, October 12, 2016 5:20 PM
>>> *To:* julia-users 
>>> *Subject:* [julia-users] Filtering DataFrame with a function
>>>
>>>
>>>
>>> Hi,
>>>
>>>
>>>
>>> I have a DataFrame for which I want to filter rows that match a given
>>> criteria. I don't have the number of columns beforehand, so I cannot
>>> explicitly list the criteria with the :symbol syntax or write down a fixed
>>> number of indices.
>>>
>>>
>>>
>>> Is there any way to filter with a lambda expression? Or even better, is
>>> there any efficient way to count the number of occurrences of a specific
>>> row of observations?
>>>
>>>
>>>
>>> -Júlio
>>>
>>
>>


Re: [julia-users] Filtering DataFrame with a function

2016-10-12 Thread anthoff
Were you worried about Query being not lightweight enough in terms of 
overhead, or in terms of syntax?

I just added a more lightweight syntax for this scenario to Query. You can 
now do the following two things:

q = @where(df, i->i.price > 30.)

that will return a filtered iterator. You can materialize that into a 
DataFrame with collect(q, DataFrame).

I also added a counting option. Turns out that is actually a LINQ query 
operator, and the goal is to implement all of those in Query. The syntax is 
simple:

@count(df, i->i.price > 30.)

returns the number of rows for which the filter condition is true.

Under the hood both of these new syntax options use the normal Query 
machinery, this just provides a simpler syntax relative to the more 
elaborate things I've posted earlier. In terms of LINQ, this corresponds to 
the method invocation API that LINQ has. I'm still figuring out how to 
surface something like @count in the query expression syntax, but for now 
one can use it via this macro.

All of this is on master right now, so you would have to do 
Pkg.checkout("Query") to get these macros.

Best,
David

On Wednesday, October 12, 2016 at 6:47:15 PM UTC-7, Júlio Hoffimann wrote:
>
> Hi David,
>
> Thank you for your elaborated answer and for writing a package for general 
> queries, that is great! I will keep the package in mind if I need something 
> more complex.
>
> I am currently looking for a lightweight solution within DataFrames, 
> filtering is a very common operation. Right now, I am considering 
> converting the DataFrame to an array and looping over the rows. I wonder if 
> there is a syntactic sugar for this loop.
>
> -Júlio
>
> 2016-10-12 17:48 GMT-07:00 David Anthoff  >:
>
>> Hi Julio,
>>
>>  
>>
>> you can use the Query package for the first part. To filter a DataFrame 
>> using some arbitrary julia expression, use something like this:
>>
>>  
>>
>> using DataFrames, Query, NamedTuples
>>
>>  
>>
>> q = @from i in df begin
>>
>> @where 
>>
>> @select i
>>
>> end
>>
>>  
>>
>> You can use any julia code in . Say your DataFrame has 
>> a column called price, then you could filter like this:
>>
>>  
>>
>> @where i.price > 30.
>>
>>  
>>
>> The i will be a NamedTuple type, so you can access the columns either by 
>> their name, or also by their index, e.g.
>>
>>  
>>
>> @where i[1] > 30.
>>
>>  
>>
>> if you want to filter by the first column. You can also just call some 
>> function that you have defined somewhere else:
>>
>>  
>>
>> @where foo(i)
>>
>>  
>>
>> As long as the  returns a Bool, you should be good.
>>
>>  
>>
>> If you run a query like this, q will be a standard julia iterator. Right 
>> now you can’t just say length(q), although that is something I should 
>> probably enable at some point (I’m also looking into the VB LINQ syntax 
>> that supports things like counting in the query expression itself).
>>
>>  
>>
>> But you could materialize the query as an array and then look at the 
>> length of that:
>>
>>  
>>
>> q = @from i in df begin
>>
>> @where 
>>
>> @select i
>>
>> @collect
>>
>> end
>>
>> count = length(q)
>>
>>  
>>
>> The @collect statement means that the query will return an array of a 
>> NamedTuple type (you can also materialize it into a whole bunch of other 
>> data structures, take a look at the documentation).
>>
>>  
>>
>> Let me know if this works, or if you have any other feedback on Query.jl, 
>> I’m much in need of some user feedback for the package at this point. Best 
>> way for that is to open issues here 
>> https://github.com/davidanthoff/Query.jl.
>>
>>  
>>
>> Best,
>>
>> David
>>
>>  
>>
>> *From:* julia...@googlegroups.com  [mailto:
>> julia...@googlegroups.com ] *On Behalf Of *Júlio Hoffimann
>> *Sent:* Wednesday, October 12, 2016 5:20 PM
>> *To:* julia-users >
>> *Subject:* [julia-users] Filtering DataFrame with a function
>>
>>  
>>
>> Hi,
>>
>>  
>>
>> I have a DataFrame for which I want to filter rows that match a given 
>> criteria. I don't have the number of columns beforehand, so I cannot 
>> explicitly list the criteria with the :symbol syntax or write down a fixed 
>> number of indices.
>>
>>  
>>
>> Is there any way to filter with a lambda expression? Or even better, is 
>> there any efficient way to count the number of occurrences of a specific 
>> row of observations?
>>
>>  
>>
>> -Júlio
>>
>
>

Re: [julia-users] Filtering DataFrame with a function

2016-10-12 Thread Júlio Hoffimann
Hi David,

Thank you for your elaborated answer and for writing a package for general
queries, that is great! I will keep the package in mind if I need something
more complex.

I am currently looking for a lightweight solution within DataFrames,
filtering is a very common operation. Right now, I am considering
converting the DataFrame to an array and looping over the rows. I wonder if
there is a syntactic sugar for this loop.

-Júlio

2016-10-12 17:48 GMT-07:00 David Anthoff :

> Hi Julio,
>
>
>
> you can use the Query package for the first part. To filter a DataFrame
> using some arbitrary julia expression, use something like this:
>
>
>
> using DataFrames, Query, NamedTuples
>
>
>
> q = @from i in df begin
>
> @where 
>
> @select i
>
> end
>
>
>
> You can use any julia code in . Say your DataFrame has
> a column called price, then you could filter like this:
>
>
>
> @where i.price > 30.
>
>
>
> The i will be a NamedTuple type, so you can access the columns either by
> their name, or also by their index, e.g.
>
>
>
> @where i[1] > 30.
>
>
>
> if you want to filter by the first column. You can also just call some
> function that you have defined somewhere else:
>
>
>
> @where foo(i)
>
>
>
> As long as the  returns a Bool, you should be good.
>
>
>
> If you run a query like this, q will be a standard julia iterator. Right
> now you can’t just say length(q), although that is something I should
> probably enable at some point (I’m also looking into the VB LINQ syntax
> that supports things like counting in the query expression itself).
>
>
>
> But you could materialize the query as an array and then look at the
> length of that:
>
>
>
> q = @from i in df begin
>
> @where 
>
> @select i
>
> @collect
>
> end
>
> count = length(q)
>
>
>
> The @collect statement means that the query will return an array of a
> NamedTuple type (you can also materialize it into a whole bunch of other
> data structures, take a look at the documentation).
>
>
>
> Let me know if this works, or if you have any other feedback on Query.jl,
> I’m much in need of some user feedback for the package at this point. Best
> way for that is to open issues here https://github.com/
> davidanthoff/Query.jl.
>
>
>
> Best,
>
> David
>
>
>
> *From:* julia-users@googlegroups.com [mailto:julia-users@googlegroups.com]
> *On Behalf Of *Júlio Hoffimann
> *Sent:* Wednesday, October 12, 2016 5:20 PM
> *To:* julia-users 
> *Subject:* [julia-users] Filtering DataFrame with a function
>
>
>
> Hi,
>
>
>
> I have a DataFrame for which I want to filter rows that match a given
> criteria. I don't have the number of columns beforehand, so I cannot
> explicitly list the criteria with the :symbol syntax or write down a fixed
> number of indices.
>
>
>
> Is there any way to filter with a lambda expression? Or even better, is
> there any efficient way to count the number of occurrences of a specific
> row of observations?
>
>
>
> -Júlio
>


RE: [julia-users] Filtering DataFrame with a function

2016-10-12 Thread David Anthoff
Hi Julio,

 

you can use the Query package for the first part. To filter a DataFrame using 
some arbitrary julia expression, use something like this:

 

using DataFrames, Query, NamedTuples

 

q = @from i in df begin

@where 

@select i

end

 

You can use any julia code in . Say your DataFrame has a 
column called price, then you could filter like this:

 

@where i.price > 30.

 

The i will be a NamedTuple type, so you can access the columns either by their 
name, or also by their index, e.g.

 

@where i[1] > 30.

 

if you want to filter by the first column. You can also just call some function 
that you have defined somewhere else:

 

@where foo(i)

 

As long as the  returns a Bool, you should be good.

 

If you run a query like this, q will be a standard julia iterator. Right now 
you can’t just say length(q), although that is something I should probably 
enable at some point (I’m also looking into the VB LINQ syntax that supports 
things like counting in the query expression itself).

 

But you could materialize the query as an array and then look at the length of 
that:

 

q = @from i in df begin

@where 

@select i

@collect

end

count = length(q)

 

The @collect statement means that the query will return an array of a 
NamedTuple type (you can also materialize it into a whole bunch of other data 
structures, take a look at the documentation).

 

Let me know if this works, or if you have any other feedback on Query.jl, I’m 
much in need of some user feedback for the package at this point. Best way for 
that is to open issues here https://github.com/davidanthoff/Query.jl.

 

Best,

David

 

From: julia-users@googlegroups.com [mailto:julia-users@googlegroups.com] On 
Behalf Of Júlio Hoffimann
Sent: Wednesday, October 12, 2016 5:20 PM
To: julia-users 
Subject: [julia-users] Filtering DataFrame with a function

 

Hi,

 

I have a DataFrame for which I want to filter rows that match a given criteria. 
I don't have the number of columns beforehand, so I cannot explicitly list the 
criteria with the :symbol syntax or write down a fixed number of indices.

 

Is there any way to filter with a lambda expression? Or even better, is there 
any efficient way to count the number of occurrences of a specific row of 
observations?

 

-Júlio