Re: covert local tsv file to orc file on distributed cloud storage(openstack).
have the first path to be something like .csv("file://home/user/dataset/data.csv") If you working with files that big -don't use the inferSchema option, as that will trigger two scans through the data -try with a smaller file first, say 1MB or so Trying to use spark *or any other tool* to upload a 150GB file to openstack is probably doomed. In fact, given how openstack handles large files, you are going to be in trouble. Spark will try and upload the file i blocks (size in KB set in spark.hadoop.fs.swift..partsize , but there's no retry logic in that code by the look of things, so the failure of any single PUT of a block will cause the run to fail. Spark will retry, but you could just hit the same problem again. Smaller file sizes: parallel processing, more resilience to failures, and easier to break things up later on. Really, really try to break things up. That said, given one of the problem is that the hadoop swift client doesn't do much retrying on failed writes, do the CSV- > ORC + snappy conversion locally in spark, then do the upload to swift using any tools you have to hand, be they command line or GUI. That should at least isolate the upload problem from the conversion On 24 Nov 2016, at 18:44, vr spark> wrote: Hi, The source file i have is on local machine and its pretty huge like 150 gb. How to go about it? On Sun, Nov 20, 2016 at 8:52 AM, Steve Loughran > wrote: On 19 Nov 2016, at 17:21, vr spark > wrote: Hi, I am looking for scala or python code samples to covert local tsv file to orc file and store on distributed cloud storage(openstack). So, need these 3 samples. Please suggest. 1. read tsv 2. convert to orc 3. store on distributed cloud storage thanks VR all options, 9 lines of code, assuming a spark context has already been setup with the permissions to write to AWS, and the relevant JARs for S3A to work on the CP. The read operation is inefficient as to determine the schema it scans the (here, remote) file twice; that may be OK for an example, but I wouldn't do that in production. The source is a real file belonging to amazon; dest a bucket of mine. More details at: http://www.slideshare.net/steve_l/apache-spark-and-object-stores val csvdata = spark.read.options(Map( "header" -> "true", "ignoreLeadingWhiteSpace" -> "true", "ignoreTrailingWhiteSpace" -> "true", "timestampFormat" -> "-MM-dd HH:mm:ss.SSSZZZ", "inferSchema" -> "true", "mode" -> "FAILFAST")) .csv("s3a://landsat-pds/scene_list.gz") csvdata.write.mode("overwrite").orc("s3a://hwdev-stevel-demo2/landsatOrc")
Re: covert local tsv file to orc file on distributed cloud storage(openstack).
Hi, The source file i have is on local machine and its pretty huge like 150 gb. How to go about it? On Sun, Nov 20, 2016 at 8:52 AM, Steve Loughranwrote: > > On 19 Nov 2016, at 17:21, vr spark wrote: > > Hi, > I am looking for scala or python code samples to covert local tsv file to > orc file and store on distributed cloud storage(openstack). > > So, need these 3 samples. Please suggest. > > 1. read tsv > 2. convert to orc > 3. store on distributed cloud storage > > > thanks > VR > > > all options, 9 lines of code, assuming a spark context has already been > setup with the permissions to write to AWS, and the relevant JARs for S3A > to work on the CP. The read operation is inefficient as to determine the > schema it scans the (here, remote) file twice; that may be OK for an > example, but I wouldn't do that in production. The source is a real file > belonging to amazon; dest a bucket of mine. > > More details at: http://www.slideshare.net/steve_l/apache-spark-and- > object-stores > > > val csvdata = spark.read.options(Map( > "header" -> "true", > "ignoreLeadingWhiteSpace" -> "true", > "ignoreTrailingWhiteSpace" -> "true", > "timestampFormat" -> "-MM-dd HH:mm:ss.SSSZZZ", > "inferSchema" -> "true", > "mode" -> "FAILFAST")) > .csv("s3a://landsat-pds/scene_list.gz") > csvdata.write.mode("overwrite").orc("s3a://hwdev-stevel-demo2/landsatOrc") >
Re: covert local tsv file to orc file on distributed cloud storage(openstack).
On 19 Nov 2016, at 17:21, vr spark> wrote: Hi, I am looking for scala or python code samples to covert local tsv file to orc file and store on distributed cloud storage(openstack). So, need these 3 samples. Please suggest. 1. read tsv 2. convert to orc 3. store on distributed cloud storage thanks VR all options, 9 lines of code, assuming a spark context has already been setup with the permissions to write to AWS, and the relevant JARs for S3A to work on the CP. The read operation is inefficient as to determine the schema it scans the (here, remote) file twice; that may be OK for an example, but I wouldn't do that in production. The source is a real file belonging to amazon; dest a bucket of mine. More details at: http://www.slideshare.net/steve_l/apache-spark-and-object-stores val csvdata = spark.read.options(Map( "header" -> "true", "ignoreLeadingWhiteSpace" -> "true", "ignoreTrailingWhiteSpace" -> "true", "timestampFormat" -> "-MM-dd HH:mm:ss.SSSZZZ", "inferSchema" -> "true", "mode" -> "FAILFAST")) .csv("s3a://landsat-pds/scene_list.gz") csvdata.write.mode("overwrite").orc("s3a://hwdev-stevel-demo2/landsatOrc")
covert local tsv file to orc file on distributed cloud storage(openstack).
Hi, I am looking for scala or python code samples to covert local tsv file to orc file and store on distributed cloud storage(openstack). So, need these 3 samples. Please suggest. 1. read tsv 2. convert to orc 3. store on distributed cloud storage thanks VR