subject:"Working with big datasets, merging two ordered lists by key"

Re: Working with big datasets, merging two ordered lists by key

2014-03-14 Thread Frank Behrens

I am still working on the solution, (see gisthttps://gist.github.com/9551489
) and want to share my current thoughts.

The problem is to process over a join on two big datasets (from different 
sources). 
Right now I a quite confident as I break the problem into smaller parts, 
and I am starting to see, how this is very easy in clojure.
1) I have to bring both datasets (lists) into a nice form: [ id 
{attributes}] might be a good fit
2) because they are sorted and the id is unique (right now) , with my 
merge-sorted 
function, i can pull the records from the list, compare them with a 
function (defaults to identity in the upper case) and pair them up.
3) from the resulting list of pair i can filter the records, which I am 
interested in, and 
4) do my processing over them.

This approach seems simple, and flexible to me, would be very useful for 
different problems we have at our big enterprise.

I am close to putting the parts together, and will then see, how this fits 
in memory,
and if it solves my current problem.

But im my newbie clojure dreams, i could imagine to get this done in a lazy 
fashion.

Can my clojureCLR databasequery, sorted textfile, merging, filtering and 
processing all be lazy


  

-- 
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
Clojure group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: Working with big datasets, merging two ordered lists by key

2014-03-10 Thread Frank Behrens

Thanks for your suggestions. 
a for loop has to do  100.000 * 300.000 compares
Storing the database table into a 300.000 element hash, would be a memory 
penalty I want to avoid.

I'm quite shure that assential part of the solution is a function to 
iterate through both list at once,
spitting out pairs of values according to compare

(merge-sortedlists 
  '(1 2 3)
  '(   24))
= ([1 nil] [2 2] [3 nil] [nil 4])

Seems quite doable.
Try to implement now.

Frank


Am Montag, 10. März 2014 01:23:57 UTC+1 schrieb frye:

 Hmm, the *for* comprehension yields a lazy sequence of results. So the 
 penalty should only occur when one starts to use / evaluate the result. 
 Using maps is a good idea. But I think you'll have to use another algorithm 
 (not *for*) to get the random access you seek. 

 Frank could try a *clojure.set/intersection* to find common keys between 
 the lists. then *order* and *map* / *merge* the 2 lists. 

 Beyond that, I can't see a scenario where some iteration won't have to 
 search the space for matching keys (which I think 
 *clojure.set/intersection* does). A fair point all the same. 


 Tim Washington 
 Interruptsoftware.com http://interruptsoftware.com 


 On Sun, Mar 9, 2014 at 12:13 PM, Moritz Ulrich 
 mor...@tarn-vedra.dejavascript:
  wrote:

 I think it would be more efficient to read one of the inputs into a
 map for random access instead of iterating it every time.

 On Sun, Mar 9, 2014 at 4:48 PM, Timothy Washington 
 twas...@gmail.comjavascript: 
 wrote:
  Hey Frank,
 
  Try opening up a repl, and running this for comprehension.
 
  (def user_textfile [[:id1 {:name 'Frank'}] [:id3 {:name 'Tim'}]])
  (def user_database [[:id1 {:age 38}] [:id2 {:age 27}] [:id3 {:age 18}] 
 [:id4
  {:age 60}]])
 
  (for [i user_textfile
  j user_database
  :when (= (first i) (first j))]
  {(first i) (merge (second i) (second j))})
 
  ({:id1 {:age 38, :name Frank'}} {:id3 {:age 18, :name Tim'}})  ;; result
  from repl
 
 
 
  Hth
 
  Tim Washington
  Interruptsoftware.com
 
 
  On Sun, Mar 9, 2014 at 5:33 AM, Frank Behrens 
  fbeh...@gmail.comjavascript: 
 wrote:
 
  Hi,
 
  i'm investigating if clojure can be used to solve the challenges and
  problems we have at my day job better than ruby or powershell. A very 
 common
  use case is validating data from different  systems against some 
 criteria. i
  believe clojure can be our silver bullet, but before that, it seems to 
 be
  required to wrap my head around it.
 
  So I am starting in the first level with the challenge to validate some
  data from the user database against our active directory.
 
  I already have all the parts to make it work: Which is to make a hash 
 by
  user_id from the database table, export a textfile from AD, each line
  representing a user, parse it, merge the information from the
  user_table_hash, and voila.
 
  I did not finish to implement this. So I don't know if this naive 
 approach
  will work with 400.000 records in the user database and 100.000 in the
  textfile.
  But I already think about how I could implement this in a more memory
  efficient way.
 
  So my simple question:
 
  I have user_textfile (100.000 records) which can be parsed into a
  unordered list of user-maps.
  I have user_table in the database(400.000 record) which I can query 
 with
  order and gives me an ordered list of user-maps.
 
  So I would first order the user_textfile and then conj the user_table
  ordered list into it, while doing the database query.
  Is that approach right ? How would I then merge the two ordered lists 
 like
  in the example below?
 
  (defn user_textfile
([:id1 {:name 'Frank'}]
 [:id3 {:name 'Tim'}]))
 
  (defn user_database
([:id1 {:age 38}]
 [:id2 {:age 27}]
 [:id3 {:age 18}]
 [:id4 {:age 60}]))
 
  (merge-sorted-lists user_database user_textfile)
  =
([:id1 {:name 'Frank' :age 38}]
 [:id3 {:name 'Tim'   :age 18}]))
 
  Any feedback is appreciated.
  Have a nice day,
  Frank
 
  --
  You received this message because you are subscribed to the Google
  Groups Clojure group.
  To post to this group, send email to clo...@googlegroups.comjavascript:
  Note that posts from new members are moderated - please be patient with
  your first post.
  To unsubscribe from this group, send email to
  clojure+u...@googlegroups.com javascript:
  For more options, visit this group at
  http://groups.google.com/group/clojure?hl=en
  ---
  You received this message because you are subscribed to the Google 
 Groups
  Clojure group.
  To unsubscribe from this group and stop receiving emails from it, send 
 an
  email to clojure+u...@googlegroups.com javascript:.
  For more options, visit https://groups.google.com/d/optout.
 
 
  --
  You received this message because you are subscribed to the Google
  Groups Clojure group.
  To post to this group, send email to clo...@googlegroups.comjavascript:
  Note that posts from new members are moderated - please be patient with 
 your

Re: Working with big datasets, merging two ordered lists by key

2014-03-10 Thread Frank Behrens

Hey, just to share, I came up with this code, which seem quite ok to me,
Feels like I already understand something, do i,
Have a nice day, Frank

(loop
  [a '(1 2 3 4)
   b '(1 3)
   out ()]
  (cond 
(and (empty? a)(empty? b)) out
(empty? a) (recur a (rest b) (conj out [nil (first 
b)]))   
(empty? b) (recur (rest a)  b (conj out [(first a) 
nil]))
:else (let
[fa   (first a)
 fb   (first b)
 cmp  (compare fa fb)]
(cond 
(= 0 cmp) (recur (rest a) (rest b) (conj out [fa fb]))
( 0 cmp) (recur (rest a)  b   (conj out [fa nil]))
:else (recur  a   (rest b) (conj out [nil fb]))


Am Montag, 10. März 2014 09:26:14 UTC+1 schrieb Frank Behrens:

 Thanks for your suggestions. 
 a for loop has to do  100.000 * 300.000 compares
 Storing the database table into a 300.000 element hash, would be a memory 
 penalty I want to avoid.

 I'm quite shure that assential part of the solution is a function to 
 iterate through both list at once,
 spitting out pairs of values according to compare

 (merge-sortedlists 
   '(1 2 3)
   '(   24))
 = ([1 nil] [2 2] [3 nil] [nil 4])

 Seems quite doable.
 Try to implement now.

 Frank


 Am Montag, 10. März 2014 01:23:57 UTC+1 schrieb frye:

 Hmm, the *for* comprehension yields a lazy sequence of results. So the 
 penalty should only occur when one starts to use / evaluate the result. 
 Using maps is a good idea. But I think you'll have to use another algorithm 
 (not *for*) to get the random access you seek. 

 Frank could try a *clojure.set/intersection* to find common keys between 
 the lists. then *order* and *map* / *merge* the 2 lists. 

 Beyond that, I can't see a scenario where some iteration won't have to 
 search the space for matching keys (which I think 
 *clojure.set/intersection* does). A fair point all the same. 


 Tim Washington 
 Interruptsoftware.com http://interruptsoftware.com 


 On Sun, Mar 9, 2014 at 12:13 PM, Moritz Ulrich mor...@tarn-vedra.dewrote:

 I think it would be more efficient to read one of the inputs into a
 map for random access instead of iterating it every time.

 On Sun, Mar 9, 2014 at 4:48 PM, Timothy Washington twas...@gmail.com 
 wrote:
  Hey Frank,
 
  Try opening up a repl, and running this for comprehension.
 
  (def user_textfile [[:id1 {:name 'Frank'}] [:id3 {:name 'Tim'}]])
  (def user_database [[:id1 {:age 38}] [:id2 {:age 27}] [:id3 {:age 18}] 
 [:id4
  {:age 60}]])
 
  (for [i user_textfile
  j user_database
  :when (= (first i) (first j))]
  {(first i) (merge (second i) (second j))})
 
  ({:id1 {:age 38, :name Frank'}} {:id3 {:age 18, :name Tim'}})  ;; 
 result
  from repl
 
 
 
  Hth
 
  Tim Washington
  Interruptsoftware.com
 
 
  On Sun, Mar 9, 2014 at 5:33 AM, Frank Behrens fbeh...@gmail.com 
 wrote:
 
  Hi,
 
  i'm investigating if clojure can be used to solve the challenges and
  problems we have at my day job better than ruby or powershell. A very 
 common
  use case is validating data from different  systems against some 
 criteria. i
  believe clojure can be our silver bullet, but before that, it seems 
 to be
  required to wrap my head around it.
 
  So I am starting in the first level with the challenge to validate 
 some
  data from the user database against our active directory.
 
  I already have all the parts to make it work: Which is to make a hash 
 by
  user_id from the database table, export a textfile from AD, each line
  representing a user, parse it, merge the information from the
  user_table_hash, and voila.
 
  I did not finish to implement this. So I don't know if this naive 
 approach
  will work with 400.000 records in the user database and 100.000 in the
  textfile.
  But I already think about how I could implement this in a more memory
  efficient way.
 
  So my simple question:
 
  I have user_textfile (100.000 records) which can be parsed into a
  unordered list of user-maps.
  I have user_table in the database(400.000 record) which I can query 
 with
  order and gives me an ordered list of user-maps.
 
  So I would first order the user_textfile and then conj the user_table
  ordered list into it, while doing the database query.
  Is that approach right ? How would I then merge the two ordered lists 
 like
  in the example below?
 
  (defn user_textfile
([:id1 {:name 'Frank'}]
 [:id3 {:name 'Tim'}]))
 
  (defn user_database
([:id1 {:age 38}]
 [:id2 {:age 27}]
 [:id3 {:age 18}]
 [:id4 {:age 60}]))
 
  (merge-sorted-lists user_database user_textfile)
  =
([:id1 {:name 'Frank' :age 38}]
 [:id3 {:name 'Tim'   :age 18}]))
 
  Any feedback is appreciated.
  Have a nice day,
  Frank
 
  --
  You received this message because you are subscribed to the Google
  Groups Clojure group.
  To post to this group, send email to clo...@googlegroups.com
  Note that posts from new members are moderated -

Re: Working with big datasets, merging two ordered lists by key

2014-03-10 Thread Timothy Washington

Hey Frank,

Right. So I tried this loop / recur, and it runs, giving a result of *([4
nil] [3 3] [2 nil] [1 1])*. But I'm not sure how that's going to help you
(although not discounting the possibility).

You can simultaneously iterate through pairs of lists, to compare values.
However you cannot guarantee that those lists will be *i)* ordered, and
*ii)* the same length. Both those conditions are required for your
algorithm to work. Plus, what you suggest still means that you'll have to
scan through the entire space of both results. So we're not going to avoid
that.

Based on your requirements, I still see my original *for* comprehension as
the most straightforward way to solve the problem. My second suggested
algorithm could also work. But I could be wrong and am always learning too.
So trying different solutions is a good habit to keep.


Hth

Tim Washington
Interruptsoftware.com http://interruptsoftware.com


On Mon, Mar 10, 2014 at 4:53 AM, Frank Behrens fbehr...@gmail.com wrote:

 Hey, just to share, I came up with this code, which seem quite ok to me,
 Feels like I already understand something, do i,
 Have a nice day, Frank

 (loop
   [a '(1 2 3 4)
b '(1 3)
out ()]
   (cond
 (and (empty? a)(empty? b)) out
 (empty? a) (recur a (rest b) (conj out [nil (first
 b)]))
 (empty? b) (recur (rest a)  b (conj out [(first a)
 nil]))
 :else (let
 [fa   (first a)
  fb   (first b)
  cmp  (compare fa fb)]
 (cond
 (= 0 cmp) (recur (rest a) (rest b) (conj out [fa fb]))
 ( 0 cmp) (recur (rest a)  b   (conj out [fa nil]))
 :else (recur  a   (rest b) (conj out [nil fb]))


 Am Montag, 10. März 2014 09:26:14 UTC+1 schrieb Frank Behrens:

 Thanks for your suggestions.
 a for loop has to do  100.000 * 300.000 compares
 Storing the database table into a 300.000 element hash, would be a memory
 penalty I want to avoid.

 I'm quite shure that assential part of the solution is a function to
 iterate through both list at once,
 spitting out pairs of values according to compare

 (merge-sortedlists
   '(1 2 3)
   '(   24))
 = ([1 nil] [2 2] [3 nil] [nil 4])

 Seems quite doable.
 Try to implement now.

 Frank



-- 
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
Clojure group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: Working with big datasets, merging two ordered lists by key

2014-03-10 Thread Leif


Re. Tim's points below:

*i)* The seqs have to be ordered, or one of them has to be loaded fully 
into memory; I don't think there's any way around that.

*ii)* Frank's solution does *not* require the seqs to be the same length, 
and it gives you the complete 'diff' of the seqs (aka outer join), which 
could be handy.  The one snag I see is that it is eager, not lazy, so it's 
going to put the answer completely in memory.  So unless you are projecting 
out a small subset of the fields from each record, you will probably end up 
using as much memory as the other solutions.  I wrote a lazy version using 
'iterate', but I'm not sure it doesn't keep both entire seqs in memory, too.

My two cents:

1. If you have enough memory, go with Moritz' suggestion to read the 
smaller seq into a map.  Then you can do a simple for comprehension and 
arrange it so that the second, larger seq will never be completely in 
memory.
2. Another possible solution is to load the textfile into a temp table in 
your database.  Then the solution is one simple SQL query, backed by 
hyper-optimized code designed to deal with this exact problem.
3. You may want to try the naive approach: 400k records sounds like it 
could very well fit into memory, as long as each record doesn't have a huge 
amount of data.
4. A library that has tools to deal with big files: 
https://github.com/kyleburton/clj-etl-utils

--Leif

On Monday, March 10, 2014 11:01:07 PM UTC-4, frye wrote:

 Hey Frank, 

 Right. So I tried this loop / recur, and it runs, giving a result of *([4 
 nil] [3 3] [2 nil] [1 1])*. But I'm not sure how that's going to help you 
 (although not discounting the possibility). 

 You can simultaneously iterate through pairs of lists, to compare values. 
 However you cannot guarantee that those lists will be *i)* ordered, and 
 *ii)* the same length. Both those conditions are required for your 
 algorithm to work. Plus, what you suggest still means that you'll have to 
 scan through the entire space of both results. So we're not going to avoid 
 that. 

 Based on your requirements, I still see my original *for* comprehension 
 as the most straightforward way to solve the problem. My second suggested 
 algorithm could also work. But I could be wrong and am always learning too. 
 So trying different solutions is a good habit to keep. 


 Hth 

 Tim Washington 
 Interruptsoftware.com http://interruptsoftware.com 

  
 On Mon, Mar 10, 2014 at 4:53 AM, Frank Behrens fbeh...@gmail.comjavascript:
  wrote:

 Hey, just to share, I came up with this code, which seem quite ok to me,
 Feels like I already understand something, do i,
 Have a nice day, Frank

 (loop
   [a '(1 2 3 4)
b '(1 3)
out ()]
   (cond 
 (and (empty? a)(empty? b)) out
 (empty? a) (recur a (rest b) (conj out [nil (first 
 b)]))   
 (empty? b) (recur (rest a)  b (conj out [(first a) 
 nil]))
 :else (let
 [fa   (first a)
  fb   (first b)
  cmp  (compare fa fb)]
 (cond 
 (= 0 cmp) (recur (rest a) (rest b) (conj out [fa fb]))
 ( 0 cmp) (recur (rest a)  b   (conj out [fa nil]))
 :else (recur  a   (rest b) (conj out [nil 
 fb]))


 Am Montag, 10. März 2014 09:26:14 UTC+1 schrieb Frank Behrens:

 Thanks for your suggestions. 
 a for loop has to do  100.000 * 300.000 compares
 Storing the database table into a 300.000 element hash, would be a 
 memory penalty I want to avoid.

 I'm quite shure that assential part of the solution is a function to 
 iterate through both list at once,
 spitting out pairs of values according to compare

 (merge-sortedlists 
   '(1 2 3)
   '(   24))
 = ([1 nil] [2 2] [3 nil] [nil 4])

 Seems quite doable.
 Try to implement now.

 Frank

 

-- 
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
Clojure group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Working with big datasets, merging two ordered lists by key

2014-03-09 Thread Frank Behrens

Hi,

i'm investigating if clojure can be used to solve the challenges and 
problems we have at my day job better than ruby or powershell. A very 
common use case is validating data from different  systems against some 
criteria. i believe clojure can be our silver bullet, but before that, it 
seems to be required to wrap my head around it.

So I am starting in the first level with the challenge to validate some 
data from the user database against our active directory.

I already have all the parts to make it work: Which is to make a hash by 
user_id from the database table, export a textfile from AD, each line 
representing a user, parse it, merge the information from the 
user_table_hash, and voila. 

I did not finish to implement this. So I don't know if this naive approach 
will work with 400.000 records in the user database and 100.000 in the 
textfile.
But I already think about how I could implement this in a more memory 
efficient way.

So my simple question:

I have user_textfile (100.000 records) which can be parsed into a unordered 
list of user-maps.
I have user_table in the database(400.000 record) which I can query with 
order and gives me an ordered list of user-maps.

So I would first order the user_textfile and then conj the user_table 
ordered list into it, while doing the database query.
Is that approach right ? How would I then merge the two ordered lists like 
in the example below?

(defn user_textfile
  ([:id1 {:name 'Frank'}]
   [:id3 {:name 'Tim'}]))  

(defn user_database
  ([:id1 {:age 38}]
   [:id2 {:age 27}]
   [:id3 {:age 18}]
   [:id4 {:age 60}])) 

(merge-sorted-lists user_database user_textfile)
=
  ([:id1 {:name 'Frank' :age 38}]
   [:id3 {:name 'Tim'   :age 18}]))  

Any feedback is appreciated.
Have a nice day,
Frank

-- 
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
Clojure group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: Working with big datasets, merging two ordered lists by key

2014-03-09 Thread Timothy Washington

Hey Frank,

Try opening up a repl, and running this *for* comprehension.

(def user_textfile [[:id1 {:name 'Frank'}] [:id3 {:name 'Tim'}]])
(def user_database [[:id1 {:age 38}] [:id2 {:age 27}] [:id3 {:age 18}]
[:id4 {:age 60}]])

(for [i user_textfile
j user_database
:when (= (first i) (first j))]
{(first i) (merge (second i) (second j))})

*({:id1 {:age 38, :name Frank'}} {:id3 {:age 18, :name Tim'}})  ;; result
from repl *



Hth

Tim Washington
Interruptsoftware.com http://interruptsoftware.com


On Sun, Mar 9, 2014 at 5:33 AM, Frank Behrens fbehr...@gmail.com wrote:

 Hi,

 i'm investigating if clojure can be used to solve the challenges and
 problems we have at my day job better than ruby or powershell. A very
 common use case is validating data from different  systems against some
 criteria. i believe clojure can be our silver bullet, but before that, it
 seems to be required to wrap my head around it.

 So I am starting in the first level with the challenge to validate some
 data from the user database against our active directory.

 I already have all the parts to make it work: Which is to make a hash by
 user_id from the database table, export a textfile from AD, each line
 representing a user, parse it, merge the information from the
 user_table_hash, and voila.

 I did not finish to implement this. So I don't know if this naive approach
 will work with 400.000 records in the user database and 100.000 in the
 textfile.
 But I already think about how I could implement this in a more memory
 efficient way.

 So my simple question:

 I have user_textfile (100.000 records) which can be parsed into a
 unordered list of user-maps.
 I have user_table in the database(400.000 record) which I can query with
 order and gives me an ordered list of user-maps.

 So I would first order the user_textfile and then conj the user_table
 ordered list into it, while doing the database query.
 Is that approach right ? How would I then merge the two ordered lists like
 in the example below?

 (defn user_textfile
   ([:id1 {:name 'Frank'}]
[:id3 {:name 'Tim'}]))

 (defn user_database
   ([:id1 {:age 38}]
[:id2 {:age 27}]
[:id3 {:age 18}]
[:id4 {:age 60}]))

 (merge-sorted-lists user_database user_textfile)
 =
   ([:id1 {:name 'Frank' :age 38}]
[:id3 {:name 'Tim'   :age 18}]))

 Any feedback is appreciated.
 Have a nice day,
 Frank

 --
 You received this message because you are subscribed to the Google
 Groups Clojure group.
 To post to this group, send email to clojure@googlegroups.com
 Note that posts from new members are moderated - please be patient with
 your first post.
 To unsubscribe from this group, send email to
 clojure+unsubscr...@googlegroups.com
 For more options, visit this group at
 http://groups.google.com/group/clojure?hl=en
 ---
 You received this message because you are subscribed to the Google Groups
 Clojure group.
 To unsubscribe from this group and stop receiving emails from it, send an
 email to clojure+unsubscr...@googlegroups.com.
 For more options, visit https://groups.google.com/d/optout.


-- 
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
Clojure group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: Working with big datasets, merging two ordered lists by key

2014-03-09 Thread Moritz Ulrich

I think it would be more efficient to read one of the inputs into a
map for random access instead of iterating it every time.

On Sun, Mar 9, 2014 at 4:48 PM, Timothy Washington twash...@gmail.com wrote:
 Hey Frank,

 Try opening up a repl, and running this for comprehension.

 (def user_textfile [[:id1 {:name 'Frank'}] [:id3 {:name 'Tim'}]])
 (def user_database [[:id1 {:age 38}] [:id2 {:age 27}] [:id3 {:age 18}] [:id4
 {:age 60}]])

 (for [i user_textfile
 j user_database
 :when (= (first i) (first j))]
 {(first i) (merge (second i) (second j))})

 ({:id1 {:age 38, :name Frank'}} {:id3 {:age 18, :name Tim'}})  ;; result
 from repl



 Hth

 Tim Washington
 Interruptsoftware.com


 On Sun, Mar 9, 2014 at 5:33 AM, Frank Behrens fbehr...@gmail.com wrote:

 Hi,

 i'm investigating if clojure can be used to solve the challenges and
 problems we have at my day job better than ruby or powershell. A very common
 use case is validating data from different  systems against some criteria. i
 believe clojure can be our silver bullet, but before that, it seems to be
 required to wrap my head around it.

 So I am starting in the first level with the challenge to validate some
 data from the user database against our active directory.

 I already have all the parts to make it work: Which is to make a hash by
 user_id from the database table, export a textfile from AD, each line
 representing a user, parse it, merge the information from the
 user_table_hash, and voila.

 I did not finish to implement this. So I don't know if this naive approach
 will work with 400.000 records in the user database and 100.000 in the
 textfile.
 But I already think about how I could implement this in a more memory
 efficient way.

 So my simple question:

 I have user_textfile (100.000 records) which can be parsed into a
 unordered list of user-maps.
 I have user_table in the database(400.000 record) which I can query with
 order and gives me an ordered list of user-maps.

 So I would first order the user_textfile and then conj the user_table
 ordered list into it, while doing the database query.
 Is that approach right ? How would I then merge the two ordered lists like
 in the example below?

 (defn user_textfile
   ([:id1 {:name 'Frank'}]
[:id3 {:name 'Tim'}]))

 (defn user_database
   ([:id1 {:age 38}]
[:id2 {:age 27}]
[:id3 {:age 18}]
[:id4 {:age 60}]))

 (merge-sorted-lists user_database user_textfile)
 =
   ([:id1 {:name 'Frank' :age 38}]
[:id3 {:name 'Tim'   :age 18}]))

 Any feedback is appreciated.
 Have a nice day,
 Frank

 --
 You received this message because you are subscribed to the Google
 Groups Clojure group.
 To post to this group, send email to clojure@googlegroups.com
 Note that posts from new members are moderated - please be patient with
 your first post.
 To unsubscribe from this group, send email to
 clojure+unsubscr...@googlegroups.com
 For more options, visit this group at
 http://groups.google.com/group/clojure?hl=en
 ---
 You received this message because you are subscribed to the Google Groups
 Clojure group.
 To unsubscribe from this group and stop receiving emails from it, send an
 email to clojure+unsubscr...@googlegroups.com.
 For more options, visit https://groups.google.com/d/optout.


 --
 You received this message because you are subscribed to the Google
 Groups Clojure group.
 To post to this group, send email to clojure@googlegroups.com
 Note that posts from new members are moderated - please be patient with your
 first post.
 To unsubscribe from this group, send email to
 clojure+unsubscr...@googlegroups.com
 For more options, visit this group at
 http://groups.google.com/group/clojure?hl=en
 ---
 You received this message because you are subscribed to the Google Groups
 Clojure group.
 To unsubscribe from this group and stop receiving emails from it, send an
 email to clojure+unsubscr...@googlegroups.com.
 For more options, visit https://groups.google.com/d/optout.

-- 
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
Clojure group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: Working with big datasets, merging two ordered lists by key

2014-03-09 Thread Timothy Washington

Hmm, the *for* comprehension yields a lazy sequence of results. So the
penalty should only occur when one starts to use / evaluate the result.
Using maps is a good idea. But I think you'll have to use another algorithm
(not *for*) to get the random access you seek.

Frank could try a *clojure.set/intersection* to find common keys between
the lists. then *order* and *map* / *merge* the 2 lists.

Beyond that, I can't see a scenario where some iteration won't have to
search the space for matching keys (which I think
*clojure.set/intersection* does).
A fair point all the same.


Tim Washington
Interruptsoftware.com http://interruptsoftware.com


On Sun, Mar 9, 2014 at 12:13 PM, Moritz Ulrich mor...@tarn-vedra.de wrote:

 I think it would be more efficient to read one of the inputs into a
 map for random access instead of iterating it every time.

 On Sun, Mar 9, 2014 at 4:48 PM, Timothy Washington twash...@gmail.com
 wrote:
  Hey Frank,
 
  Try opening up a repl, and running this for comprehension.
 
  (def user_textfile [[:id1 {:name 'Frank'}] [:id3 {:name 'Tim'}]])
  (def user_database [[:id1 {:age 38}] [:id2 {:age 27}] [:id3 {:age 18}]
 [:id4
  {:age 60}]])
 
  (for [i user_textfile
  j user_database
  :when (= (first i) (first j))]
  {(first i) (merge (second i) (second j))})
 
  ({:id1 {:age 38, :name Frank'}} {:id3 {:age 18, :name Tim'}})  ;; result
  from repl
 
 
 
  Hth
 
  Tim Washington
  Interruptsoftware.com
 
 
  On Sun, Mar 9, 2014 at 5:33 AM, Frank Behrens fbehr...@gmail.com
 wrote:
 
  Hi,
 
  i'm investigating if clojure can be used to solve the challenges and
  problems we have at my day job better than ruby or powershell. A very
 common
  use case is validating data from different  systems against some
 criteria. i
  believe clojure can be our silver bullet, but before that, it seems to
 be
  required to wrap my head around it.
 
  So I am starting in the first level with the challenge to validate some
  data from the user database against our active directory.
 
  I already have all the parts to make it work: Which is to make a hash by
  user_id from the database table, export a textfile from AD, each line
  representing a user, parse it, merge the information from the
  user_table_hash, and voila.
 
  I did not finish to implement this. So I don't know if this naive
 approach
  will work with 400.000 records in the user database and 100.000 in the
  textfile.
  But I already think about how I could implement this in a more memory
  efficient way.
 
  So my simple question:
 
  I have user_textfile (100.000 records) which can be parsed into a
  unordered list of user-maps.
  I have user_table in the database(400.000 record) which I can query with
  order and gives me an ordered list of user-maps.
 
  So I would first order the user_textfile and then conj the user_table
  ordered list into it, while doing the database query.
  Is that approach right ? How would I then merge the two ordered lists
 like
  in the example below?
 
  (defn user_textfile
([:id1 {:name 'Frank'}]
 [:id3 {:name 'Tim'}]))
 
  (defn user_database
([:id1 {:age 38}]
 [:id2 {:age 27}]
 [:id3 {:age 18}]
 [:id4 {:age 60}]))
 
  (merge-sorted-lists user_database user_textfile)
  =
([:id1 {:name 'Frank' :age 38}]
 [:id3 {:name 'Tim'   :age 18}]))
 
  Any feedback is appreciated.
  Have a nice day,
  Frank
 
  --
  You received this message because you are subscribed to the Google
  Groups Clojure group.
  To post to this group, send email to clojure@googlegroups.com
  Note that posts from new members are moderated - please be patient with
  your first post.
  To unsubscribe from this group, send email to
  clojure+unsubscr...@googlegroups.com
  For more options, visit this group at
  http://groups.google.com/group/clojure?hl=en
  ---
  You received this message because you are subscribed to the Google
 Groups
  Clojure group.
  To unsubscribe from this group and stop receiving emails from it, send
 an
  email to clojure+unsubscr...@googlegroups.com.
  For more options, visit https://groups.google.com/d/optout.
 
 
  --
  You received this message because you are subscribed to the Google
  Groups Clojure group.
  To post to this group, send email to clojure@googlegroups.com
  Note that posts from new members are moderated - please be patient with
 your
  first post.
  To unsubscribe from this group, send email to
  clojure+unsubscr...@googlegroups.com
  For more options, visit this group at
  http://groups.google.com/group/clojure?hl=en
  ---
  You received this message because you are subscribed to the Google Groups
  Clojure group.
  To unsubscribe from this group and stop receiving emails from it, send an
  email to clojure+unsubscr...@googlegroups.com.
  For more options, visit https://groups.google.com/d/optout.



-- 
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com

Re: Working with big datasets, merging two ordered lists by key

Re: Working with big datasets, merging two ordered lists by key

Re: Working with big datasets, merging two ordered lists by key

Re: Working with big datasets, merging two ordered lists by key

Re: Working with big datasets, merging two ordered lists by key

Working with big datasets, merging two ordered lists by key

Re: Working with big datasets, merging two ordered lists by key

Re: Working with big datasets, merging two ordered lists by key

Re: Working with big datasets, merging two ordered lists by key

9 matches

Site Navigation

Mail list logo

Footer information