Re: [julia-users] Re: Crashing while parsing large XML file

2016-01-30 Thread Brandon Booth
I'm a moron, but that's a different issue. I fixed the readline/eachline 
issue, but that didn't address the crashing problem. I did some 
experimenting though and I think I fixed the problem. 

I added free(str) at the end of each loop to free up the memory from 
parse_string. I parsed each line and for some reason my program was hanging 
onto the results so the memory usage was slowly creeping up until the 
program crashed. Adding frree(str) kept the memory usage flat and ran 
through the entire file.



On Thursday, January 28, 2016 at 3:38:45 PM UTC-5, Stefan Karpinski wrote:
>
> At best, you'll only see every other line, right? At worst, eachline may 
> do some IO lookahead (i.e. read one line ahead) and this will do something 
> even more confusing.
>
> On Thu, Jan 28, 2016 at 3:35 PM, Brandon Booth  > wrote:
>
>> No real reason. I was going back and forth between eachline(f) and for i 
>> = 1:n to see if it worked for 1000 rows, then 10,000 rows, etc. I ended up 
>> with a hybrid of the two. Will that matter much?
>>
>>
>> On Thursday, January 28, 2016 at 1:32:09 PM UTC-5, Diego Javier Zea wrote:
>>>
>>> Hi! 
>>>
>>> Why you are using 
>>>
>>> for line in eachline(f)  l = readline(f)
>>>
>>>
>>> instead of
>>>
>>> for l in eachline(f)
>>>
>>>
>>> ?
>>>
>>> Best
>>>
>>> El jueves, 28 de enero de 2016, 12:42:35 (UTC-3), Brandon Booth escribió:

 I'm parsing an XML file that's about 30gb and wrote the loop below to 
 parse it line by line. My code cycles through each line and builds a 1x200 
 dataframe that is appended to a larger dataframe. When the larger 
 dataframe 
 gets to 1000 rows I stream it to an SQLite table. The code works for the 
 first 25 million or so lines (which equates to 125,000 or so records in 
 the 
 SQLite table) and then freezes. I've tried it without the larger dataframe 
 but that didn't help.

 Any suggestions to avoid crashing?

 Thanks.

 Brandon



 The XML structure:
 
 value
 value>/field2>
 ...
 
 
 value
 value>/field2>
 ...
 


 My loop:

 f = open("contracts.xml","r")readline(f)n = countlines(f)tic()for line in 
 eachline(f)  l = readline(f)  if startswith(l,">> append!(df1,df)if size(df1,1) == 1000  source = convertdf(df1) 
  Data.stream!(source,sink)  deleterows!(df1,1:1000)end  else
 str = parse_string(l)r = root(str)df[symbol(name(r))] = 
 string(content(r))  endend

 close(f)


>

Re: [julia-users] Re: Crashing while parsing large XML file

2016-01-28 Thread Stefan Karpinski
At best, you'll only see every other line, right? At worst, eachline may do
some IO lookahead (i.e. read one line ahead) and this will do something
even more confusing.

On Thu, Jan 28, 2016 at 3:35 PM, Brandon Booth  wrote:

> No real reason. I was going back and forth between eachline(f) and for i =
> 1:n to see if it worked for 1000 rows, then 10,000 rows, etc. I ended up
> with a hybrid of the two. Will that matter much?
>
>
> On Thursday, January 28, 2016 at 1:32:09 PM UTC-5, Diego Javier Zea wrote:
>>
>> Hi!
>>
>> Why you are using
>>
>> for line in eachline(f)  l = readline(f)
>>
>>
>> instead of
>>
>> for l in eachline(f)
>>
>>
>> ?
>>
>> Best
>>
>> El jueves, 28 de enero de 2016, 12:42:35 (UTC-3), Brandon Booth escribió:
>>>
>>> I'm parsing an XML file that's about 30gb and wrote the loop below to
>>> parse it line by line. My code cycles through each line and builds a 1x200
>>> dataframe that is appended to a larger dataframe. When the larger dataframe
>>> gets to 1000 rows I stream it to an SQLite table. The code works for the
>>> first 25 million or so lines (which equates to 125,000 or so records in the
>>> SQLite table) and then freezes. I've tried it without the larger dataframe
>>> but that didn't help.
>>>
>>> Any suggestions to avoid crashing?
>>>
>>> Thanks.
>>>
>>> Brandon
>>>
>>>
>>>
>>> The XML structure:
>>> 
>>> value
>>> value>/field2>
>>> ...
>>> 
>>> 
>>> value
>>> value>/field2>
>>> ...
>>> 
>>>
>>>
>>> My loop:
>>>
>>> f = open("contracts.xml","r")readline(f)n = countlines(f)tic()for line in 
>>> eachline(f)  l = readline(f)  if startswith(l,">> DataFrame(df_types,df_names, 1)  elseif startswith(l,">> append!(df1,df)if size(df1,1) == 1000  source = convertdf(df1)  
>>> Data.stream!(source,sink)  deleterows!(df1,1:1000)end  elsestr 
>>> = parse_string(l)r = root(str)df[symbol(name(r))] = 
>>> string(content(r))  endend
>>>
>>> close(f)
>>>
>>>