Re: reading large XML files

2014-05-20 Thread Xiangrui Meng
Try sc.wholeTextFiles(). It reads the entire file into a string
record. -Xiangrui

On Tue, May 20, 2014 at 8:25 AM, Nathan Kronenfeld
nkronenf...@oculusinfo.com wrote:
 We are trying to read some large GraphML files to use in spark.

 Is there an easy way to read XML-based files like this that accounts for
 partition boundaries and the like?

  Thanks,
  Nathan


 --
 Nathan Kronenfeld
 Senior Visualization Developer
 Oculus Info Inc
 2 Berkeley Street, Suite 600,
 Toronto, Ontario M5A 4J5
 Phone:  +1-416-203-3003 x 238
 Email:  nkronenf...@oculusinfo.com


Re: reading large XML files

2014-05-20 Thread Nathan Kronenfeld
Unfortunately, I don't have a bunch of moderately big xml files; I have
one, really big file - big enough that reading it into memory as a single
string is not feasible.


On Tue, May 20, 2014 at 1:24 PM, Xiangrui Meng men...@gmail.com wrote:

 Try sc.wholeTextFiles(). It reads the entire file into a string
 record. -Xiangrui

 On Tue, May 20, 2014 at 8:25 AM, Nathan Kronenfeld
 nkronenf...@oculusinfo.com wrote:
  We are trying to read some large GraphML files to use in spark.
 
  Is there an easy way to read XML-based files like this that accounts for
  partition boundaries and the like?
 
   Thanks,
   Nathan
 
 
  --
  Nathan Kronenfeld
  Senior Visualization Developer
  Oculus Info Inc
  2 Berkeley Street, Suite 600,
  Toronto, Ontario M5A 4J5
  Phone:  +1-416-203-3003 x 238
  Email:  nkronenf...@oculusinfo.com




-- 
Nathan Kronenfeld
Senior Visualization Developer
Oculus Info Inc
2 Berkeley Street, Suite 600,
Toronto, Ontario M5A 4J5
Phone:  +1-416-203-3003 x 238
Email:  nkronenf...@oculusinfo.com


Re: reading large XML files

2014-05-20 Thread Nathan Kronenfeld
Thanks, that sounds perfect



On Tue, May 20, 2014 at 1:38 PM, Xiangrui Meng men...@gmail.com wrote:

 You can search for XMLInputFormat on Google. There are some
 implementations that allow you to specify the tag to split on, e.g.:

 https://github.com/lintool/Cloud9/blob/master/src/dist/edu/umd/cloud9/collection/XMLInputFormat.java

 On Tue, May 20, 2014 at 10:31 AM, Nathan Kronenfeld
 nkronenf...@oculusinfo.com wrote:
  Unfortunately, I don't have a bunch of moderately big xml files; I have
 one,
  really big file - big enough that reading it into memory as a single
 string
  is not feasible.
 
 
  On Tue, May 20, 2014 at 1:24 PM, Xiangrui Meng men...@gmail.com wrote:
 
  Try sc.wholeTextFiles(). It reads the entire file into a string
  record. -Xiangrui
 
  On Tue, May 20, 2014 at 8:25 AM, Nathan Kronenfeld
  nkronenf...@oculusinfo.com wrote:
   We are trying to read some large GraphML files to use in spark.
  
   Is there an easy way to read XML-based files like this that accounts
 for
   partition boundaries and the like?
  
Thanks,
Nathan
  
  
   --
   Nathan Kronenfeld
   Senior Visualization Developer
   Oculus Info Inc
   2 Berkeley Street, Suite 600,
   Toronto, Ontario M5A 4J5
   Phone:  +1-416-203-3003 x 238
   Email:  nkronenf...@oculusinfo.com
 
 
 
 
  --
  Nathan Kronenfeld
  Senior Visualization Developer
  Oculus Info Inc
  2 Berkeley Street, Suite 600,
  Toronto, Ontario M5A 4J5
  Phone:  +1-416-203-3003 x 238
  Email:  nkronenf...@oculusinfo.com




-- 
Nathan Kronenfeld
Senior Visualization Developer
Oculus Info Inc
2 Berkeley Street, Suite 600,
Toronto, Ontario M5A 4J5
Phone:  +1-416-203-3003 x 238
Email:  nkronenf...@oculusinfo.com