Re: covert local tsv file to orc file on distributed cloud storage(openstack).

2016-11-25 Thread Steve Loughran
have the first path to be something like 
.csv("file://home/user/dataset/data.csv")


If you working with files that big
 -don't use the inferSchema option, as that will trigger two scans through the 
data
 -try with a smaller file first, say 1MB or so

Trying to use spark *or any other tool* to upload a 150GB file to openstack is 
probably doomed. In fact, given how openstack handles large files, you are 
going to be in trouble.  Spark will try and upload the file i blocks (size in 
KB set in spark.hadoop.fs.swift..partsize  , but there's no retry logic in that 
code by the look of things, so the failure of any single PUT of a block will 
cause the run to fail. Spark will retry, but you could just hit the same 
problem again. Smaller file sizes: parallel processing, more resilience to 
failures, and easier to break things up later on.

Really, really try to break things up.

That said, given one of the problem is that the hadoop swift client doesn't do 
much retrying on failed writes, do the CSV- > ORC + snappy conversion locally 
in spark, then do the upload to swift using any tools you have to hand, be they 
command line or GUI. That should at least isolate the upload problem from the 
conversion



On 24 Nov 2016, at 18:44, vr spark 
> wrote:

Hi, The source file i have is on local machine and its pretty huge like 150 gb. 
 How to go about it?

On Sun, Nov 20, 2016 at 8:52 AM, Steve Loughran 
> wrote:

On 19 Nov 2016, at 17:21, vr spark 
> wrote:

Hi,
I am looking for scala or python code samples to covert local tsv file to orc 
file and store on distributed cloud storage(openstack).

So, need these 3 samples. Please suggest.

1. read tsv
2. convert to orc
3. store on distributed cloud storage


thanks
VR

all options, 9 lines of code, assuming a spark context has already been setup 
with the permissions to write to AWS, and the relevant JARs for S3A to work on 
the CP. The read operation is inefficient as to determine the schema it scans 
the (here, remote) file twice; that may be OK for an example, but I wouldn't do 
that in production. The source is a real file belonging to amazon; dest a 
bucket of mine.

More details at: 
http://www.slideshare.net/steve_l/apache-spark-and-object-stores


val csvdata = spark.read.options(Map(
  "header" -> "true",
  "ignoreLeadingWhiteSpace" -> "true",
  "ignoreTrailingWhiteSpace" -> "true",
  "timestampFormat" -> "-MM-dd HH:mm:ss.SSSZZZ",
  "inferSchema" -> "true",
  "mode" -> "FAILFAST"))
.csv("s3a://landsat-pds/scene_list.gz")
csvdata.write.mode("overwrite").orc("s3a://hwdev-stevel-demo2/landsatOrc")




Re: covert local tsv file to orc file on distributed cloud storage(openstack).

2016-11-24 Thread vr spark
Hi, The source file i have is on local machine and its pretty huge like 150
gb.  How to go about it?

On Sun, Nov 20, 2016 at 8:52 AM, Steve Loughran 
wrote:

>
> On 19 Nov 2016, at 17:21, vr spark  wrote:
>
> Hi,
> I am looking for scala or python code samples to covert local tsv file to
> orc file and store on distributed cloud storage(openstack).
>
> So, need these 3 samples. Please suggest.
>
> 1. read tsv
> 2. convert to orc
> 3. store on distributed cloud storage
>
>
> thanks
> VR
>
>
> all options, 9 lines of code, assuming a spark context has already been
> setup with the permissions to write to AWS, and the relevant JARs for S3A
> to work on the CP. The read operation is inefficient as to determine the
> schema it scans the (here, remote) file twice; that may be OK for an
> example, but I wouldn't do that in production. The source is a real file
> belonging to amazon; dest a bucket of mine.
>
> More details at: http://www.slideshare.net/steve_l/apache-spark-and-
> object-stores
>
>
> val csvdata = spark.read.options(Map(
>   "header" -> "true",
>   "ignoreLeadingWhiteSpace" -> "true",
>   "ignoreTrailingWhiteSpace" -> "true",
>   "timestampFormat" -> "-MM-dd HH:mm:ss.SSSZZZ",
>   "inferSchema" -> "true",
>   "mode" -> "FAILFAST"))
> .csv("s3a://landsat-pds/scene_list.gz")
> csvdata.write.mode("overwrite").orc("s3a://hwdev-stevel-demo2/landsatOrc")
>


Re: covert local tsv file to orc file on distributed cloud storage(openstack).

2016-11-20 Thread Steve Loughran

On 19 Nov 2016, at 17:21, vr spark 
> wrote:

Hi,
I am looking for scala or python code samples to covert local tsv file to orc 
file and store on distributed cloud storage(openstack).

So, need these 3 samples. Please suggest.

1. read tsv
2. convert to orc
3. store on distributed cloud storage


thanks
VR

all options, 9 lines of code, assuming a spark context has already been setup 
with the permissions to write to AWS, and the relevant JARs for S3A to work on 
the CP. The read operation is inefficient as to determine the schema it scans 
the (here, remote) file twice; that may be OK for an example, but I wouldn't do 
that in production. The source is a real file belonging to amazon; dest a 
bucket of mine.

More details at: 
http://www.slideshare.net/steve_l/apache-spark-and-object-stores


val csvdata = spark.read.options(Map(
  "header" -> "true",
  "ignoreLeadingWhiteSpace" -> "true",
  "ignoreTrailingWhiteSpace" -> "true",
  "timestampFormat" -> "-MM-dd HH:mm:ss.SSSZZZ",
  "inferSchema" -> "true",
  "mode" -> "FAILFAST"))
.csv("s3a://landsat-pds/scene_list.gz")
csvdata.write.mode("overwrite").orc("s3a://hwdev-stevel-demo2/landsatOrc")


covert local tsv file to orc file on distributed cloud storage(openstack).

2016-11-19 Thread vr spark
Hi,
I am looking for scala or python code samples to covert local tsv file to
orc file and store on distributed cloud storage(openstack).

So, need these 3 samples. Please suggest.

1. read tsv
2. convert to orc
3. store on distributed cloud storage


thanks
VR