Re: Spark Read from Google store and save in AWS s3
This should help https://cloud.google.com/hadoop/examples/bigquery-connector-spark-example On 8 January 2017 at 03:49, neil90 <neilp1...@icloud.com> wrote: > Here is how you would read from Google Cloud Storage(note you need to > create > a service account key) -> > > os.environ['PYSPARK_SUBMIT_ARGS'] = """--jars > /home/neil/Downloads/gcs-connector-latest-hadoop2.jar pyspark-shell""" > > from pyspark import SparkContext, SparkConf > from pyspark.sql import SparkSession, SQLContext > > conf = SparkConf()\ > .setMaster("local[8]")\ > .setAppName("GS") > > sc = SparkContext(conf=conf) > > sc._jsc.hadoopConfiguration().set("fs.gs.impl", > "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem") > sc._jsc.hadoopConfiguration().set("fs.AbstractFileSystem.gs.impl", > "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS") > sc._jsc.hadoopConfiguration().set("fs.gs.project.id", "PUT UR GOOGLE > PROJECT > ID HERE") > > sc._jsc.hadoopConfiguration().set("fs.gs.auth.service.account.email", > "testa...@sparkgcs.iam.gserviceaccount.com") > sc._jsc.hadoopConfiguration().set("fs.gs.auth.service.account.enable", > "true") > sc._jsc.hadoopConfiguration().set("fs.gs.auth.service.account.keyfile", > "sparkgcs-96bd21691c29.p12") > > spark = SparkSession.builder\ > .config(conf=sc.getConf())\ > .getOrCreate() > > dfTermRaw = spark.read.format("csv")\ > .option("header", "true")\ > .option("delimiter" ,"\t")\ > .option("inferSchema", "true")\ > .load("gs://bucket_test/sample.tsv") > > > > > -- > View this message in context: http://apache-spark-user-list. > 1001560.n3.nabble.com/Spark-Read-from-Google-store-and- > save-in-AWS-s3-tp28278p28286.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > >
Re: Spark Read from Google store and save in AWS s3
Here is how you would read from Google Cloud Storage(note you need to create a service account key) -> os.environ['PYSPARK_SUBMIT_ARGS'] = """--jars /home/neil/Downloads/gcs-connector-latest-hadoop2.jar pyspark-shell""" from pyspark import SparkContext, SparkConf from pyspark.sql import SparkSession, SQLContext conf = SparkConf()\ .setMaster("local[8]")\ .setAppName("GS") sc = SparkContext(conf=conf) sc._jsc.hadoopConfiguration().set("fs.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem") sc._jsc.hadoopConfiguration().set("fs.AbstractFileSystem.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS") sc._jsc.hadoopConfiguration().set("fs.gs.project.id", "PUT UR GOOGLE PROJECT ID HERE") sc._jsc.hadoopConfiguration().set("fs.gs.auth.service.account.email", "testa...@sparkgcs.iam.gserviceaccount.com") sc._jsc.hadoopConfiguration().set("fs.gs.auth.service.account.enable", "true") sc._jsc.hadoopConfiguration().set("fs.gs.auth.service.account.keyfile", "sparkgcs-96bd21691c29.p12") spark = SparkSession.builder\ .config(conf=sc.getConf())\ .getOrCreate() dfTermRaw = spark.read.format("csv")\ .option("header", "true")\ .option("delimiter" ,"\t")\ .option("inferSchema", "true")\ .load("gs://bucket_test/sample.tsv") -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Read-from-Google-store-and-save-in-AWS-s3-tp28278p28286.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: Spark Read from Google store and save in AWS s3
On 5 Jan 2017, at 20:07, Manohar Reddy> wrote: Hi Steve, Thanks for the reply and below is follow-up help needed from you. Do you mean we can set up two native file system to single sparkcontext ,so then based on urls prefix( gs://bucket/path and dest s3a://bucket-on-s3/path2) will that identify and write/read appropriate cloud. Is that my understanding right? I wouldn't use the term "native FS", as they are all just client libraries to talk to the relevant object stores. You'd still have to have the cluster "default" FS. but yes, you can use them: get your classpath right and they are all just URLS you use your code
RE: Spark Read from Google store and save in AWS s3
Hi Steve, Thanks for the reply and below is follow-up help needed from you. Do you mean we can set up two native file system to single sparkcontext ,so then based on urls prefix( gs://bucket/path and dest s3a://bucket-on-s3/path2) will that identify and write/read appropriate cloud. Is that my understanding right? Manohar From: Steve Loughran [mailto:ste...@hortonworks.com] Sent: Thursday, January 5, 2017 11:05 PM To: Manohar Reddy Cc: user@spark.apache.org Subject: Re: Spark Read from Google store and save in AWS s3 On 5 Jan 2017, at 09:58, Manohar753 <manohar.re...@happiestminds.com<mailto:manohar.re...@happiestminds.com>> wrote: Hi All, Using spark is interoperability communication between two clouds(Google,AWS) possible. in my use case i need to take Google store as input to spark and do some processing and finally needs to store in S3 and my spark engine runs on AWS Cluster. Please let me back is there any way for this kind of usecase bu using directly spark without any middle components and share the info or link if you have. Thanks, I've not played with GCS, and have some noted concerns about test coverage ( https://github.com/GoogleCloudPlatform/bigdata-interop/pull/40<https://apac01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FGoogleCloudPlatform%2Fbigdata-interop%2Fpull%2F40=01%7C01%7Cmanohar.reddy%40happiestminds.com%7Cbf37d0a14cf546775eff08d4359124ab%7C7742820587ff4048a64591b337240228%7C0=cDw0a70YhyRfMjF6po61PqRPEPr0u1HKfoUdqk4%2FRsw%3D=0> ) , but assuming you are not hitting any specific problems, it should be a matter of having the input as gs://bucket/path and dest s3a://bucket-on-s3/path2 You'll need the google storage JARs on your classpath, along with those needed for S3n/s3a. 1. little talk on the topic, though I only play with azure and s3 https://www.youtube.com/watch?v=ND4L_zSDqF0<https://apac01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.youtube.com%2Fwatch%3Fv%3DND4L_zSDqF0=01%7C01%7Cmanohar.reddy%40happiestminds.com%7Cbf37d0a14cf546775eff08d4359124ab%7C7742820587ff4048a64591b337240228%7C0=zxNR4LC16FUXK9gTYUgf1%2B1B6hYE6dhluQxyPlxOb84%3D=0> 2. some notes; bear in mind that the s3a performance tuning covered relates to things surfacing in Hadoop 2.8, which you probably wont have. https://hortonworks.github.io/hdp-aws/s3-spark/<https://apac01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fhortonworks.github.io%2Fhdp-aws%2Fs3-spark%2F=01%7C01%7Cmanohar.reddy%40happiestminds.com%7Cbf37d0a14cf546775eff08d4359124ab%7C7742820587ff4048a64591b337240228%7C0=mcyxnzSOq1Tx05kBvifZ9TCcoymaiTS47lAJyTH5KZw%3D=0> A one line test for s3 installed is can you read the landsat CSV file sparkContext.textFile("s3a://landsat-pds/scene_list.gz").count() this should work from wherever you are if your classpath and credentials are set up Happiest Minds Disclaimer This message is for the sole use of the intended recipient(s) and may contain confidential, proprietary or legally privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the original intended recipient of the message, please contact the sender by reply email and destroy all copies of the original message. Happiest Minds Technologies <http://www.happiestminds.com>
Re: Spark Read from Google store and save in AWS s3
On 5 Jan 2017, at 09:58, Manohar753> wrote: Hi All, Using spark is interoperability communication between two clouds(Google,AWS) possible. in my use case i need to take Google store as input to spark and do some processing and finally needs to store in S3 and my spark engine runs on AWS Cluster. Please let me back is there any way for this kind of usecase bu using directly spark without any middle components and share the info or link if you have. Thanks, I've not played with GCS, and have some noted concerns about test coverage ( https://github.com/GoogleCloudPlatform/bigdata-interop/pull/40 ) , but assuming you are not hitting any specific problems, it should be a matter of having the input as gs://bucket/path and dest s3a://bucket-on-s3/path2 You'll need the google storage JARs on your classpath, along with those needed for S3n/s3a. 1. little talk on the topic, though I only play with azure and s3 https://www.youtube.com/watch?v=ND4L_zSDqF0 2. some notes; bear in mind that the s3a performance tuning covered relates to things surfacing in Hadoop 2.8, which you probably wont have. https://hortonworks.github.io/hdp-aws/s3-spark/ A one line test for s3 installed is can you read the landsat CSV file sparkContext.textFile("s3a://landsat-pds/scene_list.gz").count() this should work from wherever you are if your classpath and credentials are set up
Spark Read from Google store and save in AWS s3
Hi All, Using spark is interoperability communication between two clouds(Google,AWS) possible. in my use case i need to take Google store as input to spark and do some processing and finally needs to store in S3 and my spark engine runs on AWS Cluster. Please let me back is there any way for this kind of usecase bu using directly spark without any middle components and share the info or link if you have. Thanks, -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Read-from-Google-store-and-save-in-AWS-s3-tp28278.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org