Re: [Large File Processing] What am I doing wrong?

2014-01-27 Thread Curtis Gagliardi
If ordering isn't important, I'd just dump them all into a set instead of 
manually checking whether or or not you already put the url into a set. 


On Sunday, January 26, 2014 10:46:46 PM UTC-8, danneu wrote:
>
> I use line-seq, split, and destructuring to parse large CSVs.
>
> Here's how I'd approach what I think you're trying to do:
>
> (with-open [rdr (io/reader (io/resource csv :encoding "UTF-16"))]
> (let [extract-url-hash (fn [line]
>  (let [[_ _ _ url & _] (str/split line 
> #"\t")]
>[(m/md5 url) url]))]
>   (->> (drop 1 (line-seq rdr))
>(map extract-url-hash)
>(into {}
>
> https://gist.github.com/danneu/8644022
>
> On Tuesday, January 21, 2014 12:55:00 AM UTC-6, Jarrod Swart wrote:
>>
>> I'm processing a large csv with Clojure, honestly not even that big (~18k 
>> rows, 11mb).  I have a list of exported data from a client and I am 
>> de-duplicating URLs within the list.  My final output is a series of 
>> vectors: [url url-hash].
>>
>> The odd thing is how slow it seems to be going.  I have tried 
>> implementing this as a reduce, and finally I thought to speed things up I 
>> might try a "with-open and a loop-recur".  It doesn't seem to have done 
>> much in my case.  I know I am doing something wrong I'm just not sure what 
>> yet.  The best I can do is about 4 seconds, which may only seem slow 
>> because I implemented it in python first and it takes a half second to 
>> finish.  Still this is one of the smaller files I will likely deal with so 
>> I'm worried that as the files grow it may get too slow.
>>
>> The code is here on ref-heap for easy viewing: 
>> https://www.refheap.com/26098
>>
>> Any advice is appreciated.
>>
>

-- 
-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
"Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


Re: [Large File Processing] What am I doing wrong?

2014-01-26 Thread danneu
I use line-seq, split, and destructuring to parse large CSVs.

Here's how I'd approach what I think you're trying to do:

(with-open [rdr (io/reader (io/resource csv :encoding "UTF-16"))]
(let [extract-url-hash (fn [line]
 (let [[_ _ _ url & _] (str/split line 
#"\t")]
   [(m/md5 url) url]))]
  (->> (drop 1 (line-seq rdr))
   (map extract-url-hash)
   (into {}

https://gist.github.com/danneu/8644022

On Tuesday, January 21, 2014 12:55:00 AM UTC-6, Jarrod Swart wrote:
>
> I'm processing a large csv with Clojure, honestly not even that big (~18k 
> rows, 11mb).  I have a list of exported data from a client and I am 
> de-duplicating URLs within the list.  My final output is a series of 
> vectors: [url url-hash].
>
> The odd thing is how slow it seems to be going.  I have tried implementing 
> this as a reduce, and finally I thought to speed things up I might try a 
> "with-open and a loop-recur".  It doesn't seem to have done much in my 
> case.  I know I am doing something wrong I'm just not sure what yet.  The 
> best I can do is about 4 seconds, which may only seem slow because I 
> implemented it in python first and it takes a half second to finish.  Still 
> this is one of the smaller files I will likely deal with so I'm worried 
> that as the files grow it may get too slow.
>
> The code is here on ref-heap for easy viewing: 
> https://www.refheap.com/26098
>
> Any advice is appreciated.
>

-- 
-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
"Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


Re: [Large File Processing] What am I doing wrong?

2014-01-21 Thread Jarrod Swart
Jim,

Thanks for the idioms, I appreciate it!

And thanks everyone for the help!

On Tuesday, January 21, 2014 8:43:40 AM UTC-5, Jim foo.bar wrote:
>
> On 21/01/14 13:11, Chris Perkins wrote: 
> > This part: (some #{hashed} already-seen) is doing a linear lookup in 
> > `already-seen`. Try (contains? already-seen hashed) instead. 
>
> +1 to that as it will become faster... 
>
> I would also add the following not so related to performance: 
>
> (drop1  (line-seqf)) ==> (next(line-seqf)) 
>
>
> (ifseen?  nil  [url  hashed]) ==> (when-not seen?[url  hashed]) 
>
> (ifseen?  nil  hashed)  ==>(when-not seen? hashed) 
>
> (if(seq(restlines))...  ==>  (if(seqlines)... 
>
>
> I actually think the last one is a bug...it seems to me that you are 
> skipping one row in the condition...you pass (rest lines) every time you 
> recurse yes? 
> checking for more lines should be done for *all* current lines, not (rest 
> current-lines)...unless I 've misunderstood something... 
>
>
> Jim 
>
>
>

-- 
-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
"Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


Re: [Large File Processing] What am I doing wrong?

2014-01-21 Thread Jarrod Swart
Chris,

Thanks this was in fact it.  I had read that sets had a near O[1] lookup, 
but apparently I was not achieving this properly with (some).  Thank you 
the execution time is about 25x faster now!

Jarrod

On Tuesday, January 21, 2014 8:11:09 AM UTC-5, Chris Perkins wrote:
>
> On Monday, January 20, 2014 11:55:00 PM UTC-7, Jarrod Swart wrote:
>>
>> I'm processing a large csv with Clojure, honestly not even that big (~18k 
>> rows, 11mb).  I have a list of exported data from a client and I am 
>> de-duplicating URLs within the list.  My final output is a series of 
>> vectors: [url url-hash].
>>
>> The odd thing is how slow it seems to be going.  I have tried 
>> implementing this as a reduce, and finally I thought to speed things up I 
>> might try a "with-open and a loop-recur".  It doesn't seem to have done 
>> much in my case.  I know I am doing something wrong I'm just not sure what 
>> yet.  The best I can do is about 4 seconds, which may only seem slow 
>> because I implemented it in python first and it takes a half second to 
>> finish.  Still this is one of the smaller files I will likely deal with so 
>> I'm worried that as the files grow it may get too slow.
>>
>> The code is here on ref-heap for easy viewing: 
>> https://www.refheap.com/26098
>>
>> Any advice is appreciated.
>>
>
> This part: (some #{hashed} already-seen) is doing a linear lookup in 
> `already-seen`. Try (contains? already-seen hashed) instead.
>
> - Chris
>
>

-- 
-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
"Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


Re: [Large File Processing] What am I doing wrong?

2014-01-21 Thread Michael Gardner
On Jan 21, 2014, at 07:11 , Chris Perkins  wrote:

> This part: (some #{hashed} already-seen) is doing a linear lookup in 
> `already-seen`. Try (contains? already-seen hashed) instead.

Or just (already-seen hashed), given that OP's not trying to store nil hashes.

To OP: note that if you’re storing the hashes as strings (as it appears), 
you’re using 16 more bytes per hash than necessary. If you’re really going to 
be dealing with so many URLs that you’d use too much memory by storing the 
unique URLs directly, then you should probably be storing the hashes as byte 
arrays.

Alternatively, if you’re going to be dealing with REALLY large files and are 
running on Linux/BSD, consider dumping just the URLs to a file and using “sort 
-u” on it. UNIX Sort can efficiently handle files that are too large to fit in 
memory, via external merge sort.

-- 
-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
"Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


Re: [Large File Processing] What am I doing wrong?

2014-01-21 Thread Jim - FooBar();

On 21/01/14 13:11, Chris Perkins wrote:
This part: (some #{hashed} already-seen) is doing a linear lookup in 
`already-seen`. Try (contains? already-seen hashed) instead.


+1 to that as it will become faster...

I would also add the following not so related to performance:

(drop1  (line-seqf)) ==> (next(line-seqf))


(ifseen?  nil  [url  hashed]) ==> (when-not seen?[url  hashed])

(ifseen?  nil  hashed)  ==>(when-not seen? hashed)

(if(seq(restlines))...  ==>  (if(seqlines)...


I actually think the last one is a bug...it seems to me that you are skipping 
one row in the condition...you pass (rest lines) every time you recurse yes?
checking for more lines should be done for *all* current lines, not (rest 
current-lines)...unless I 've misunderstood something...


Jim


--
--
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups "Clojure" group.

To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


Re: [Large File Processing] What am I doing wrong?

2014-01-21 Thread Chris Perkins
On Monday, January 20, 2014 11:55:00 PM UTC-7, Jarrod Swart wrote:
>
> I'm processing a large csv with Clojure, honestly not even that big (~18k 
> rows, 11mb).  I have a list of exported data from a client and I am 
> de-duplicating URLs within the list.  My final output is a series of 
> vectors: [url url-hash].
>
> The odd thing is how slow it seems to be going.  I have tried implementing 
> this as a reduce, and finally I thought to speed things up I might try a 
> "with-open and a loop-recur".  It doesn't seem to have done much in my 
> case.  I know I am doing something wrong I'm just not sure what yet.  The 
> best I can do is about 4 seconds, which may only seem slow because I 
> implemented it in python first and it takes a half second to finish.  Still 
> this is one of the smaller files I will likely deal with so I'm worried 
> that as the files grow it may get too slow.
>
> The code is here on ref-heap for easy viewing: 
> https://www.refheap.com/26098
>
> Any advice is appreciated.
>

This part: (some #{hashed} already-seen) is doing a linear lookup in 
`already-seen`. Try (contains? already-seen hashed) instead.

- Chris

-- 
-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
"Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


Re: [Large File Processing] What am I doing wrong?

2014-01-21 Thread Rudi Engelbrecht
Hi Jarrod

I have had success with the clojure-csv [1] library and processing large files 
in a lazy way (as opposed to using slurp).

[1] - clojure-csv - https://github.com/davidsantiago/clojure-csv

Here is a copy of my source code (disclaimer - this is my first Clojure program 
- so some things might not be idiomatic).

This code handles a 250MB file, 315K rows (each row has 100 columns / fields) 
really well, and can scale in terms of memory usage since it handles the file 
lazily and processes / parses each line one at a time.

See snippets of code below

(ns scripts.core
  (:gen-class))

(require '[clojure.java.io :as io]
 '[clojure-csv.core :as csv]
 '[clojure.string :as str])

(def line-count 0)

(defn parse-row [row]
  (first (csv/parse-csv row :delimiter \tab)))

(defn parse-file [filename]
  (with-open [file (io/reader filename)]
(doseq [line (line-seq file)]
  (let [record (parse-row line)]
(println record)) ;; replace println record with your own logic
  (def line-count (inc line-count)

(defn process-file [filename]
  (do
(def line-count 0)
(parse-file filename)
(println line-count)))

(defn -main [& args]
  (process-file (first args)))

Feel free to ask questions if you need more info.

Kind regards

Rudi

On 21/01/2014, at 5:55 PM, Jarrod Swart  wrote:

> I'm processing a large csv with Clojure, honestly not even that big (~18k 
> rows, 11mb).  I have a list of exported data from a client and I am 
> de-duplicating URLs within the list.  My final output is a series of vectors: 
> [url url-hash].
> 
> The odd thing is how slow it seems to be going.  I have tried implementing 
> this as a reduce, and finally I thought to speed things up I might try a 
> "with-open and a loop-recur".  It doesn't seem to have done much in my case.  
> I know I am doing something wrong I'm just not sure what yet.  The best I can 
> do is about 4 seconds, which may only seem slow because I implemented it in 
> python first and it takes a half second to finish.  Still this is one of the 
> smaller files I will likely deal with so I'm worried that as the files grow 
> it may get too slow.
> 
> The code is here on ref-heap for easy viewing: https://www.refheap.com/26098
> 
> Any advice is appreciated.
> 
> -- 
> -- 
> You received this message because you are subscribed to the Google
> Groups "Clojure" group.
> To post to this group, send email to clojure@googlegroups.com
> Note that posts from new members are moderated - please be patient with your 
> first post.
> To unsubscribe from this group, send email to
> clojure+unsubscr...@googlegroups.com
> For more options, visit this group at
> http://groups.google.com/group/clojure?hl=en
> --- 
> You received this message because you are subscribed to the Google Groups 
> "Clojure" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to clojure+unsubscr...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.

-- 
-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
"Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


[Large File Processing] What am I doing wrong?

2014-01-20 Thread Jarrod Swart
I'm processing a large csv with Clojure, honestly not even that big (~18k 
rows, 11mb).  I have a list of exported data from a client and I am 
de-duplicating URLs within the list.  My final output is a series of 
vectors: [url url-hash].

The odd thing is how slow it seems to be going.  I have tried implementing 
this as a reduce, and finally I thought to speed things up I might try a 
"with-open and a loop-recur".  It doesn't seem to have done much in my 
case.  I know I am doing something wrong I'm just not sure what yet.  The 
best I can do is about 4 seconds, which may only seem slow because I 
implemented it in python first and it takes a half second to finish.  Still 
this is one of the smaller files I will likely deal with so I'm worried 
that as the files grow it may get too slow.

The code is here on ref-heap for easy viewing: https://www.refheap.com/26098

Any advice is appreciated.

-- 
-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
"Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.