Re: Beam Spark/Flink runner with DC/OS

Chaoran Yu Mon, 23 Jan 2017 13:55:44 -0800

I didn’t upload the file. But since the identical Beam code, when running in 
Spark local mode, was able to fetch the file and process it, the file does 
exist.
It’s just that somehow Spark standalone mode can’t find the file.



> On Jan 23, 2017, at 4:50 PM, Amit Sela <[email protected]> wrote:
> 
> I think "external" is the key here, you're cluster is running all it's 
> components on your local machine so you're good.
> 
> As for GS, it's like Amazon's S3 or sort-of a cloud service HDFS offered by 
> Google. You need to upload your file to GS. Have you ?  
> 
> On Mon, Jan 23, 2017 at 11:47 PM Chaoran Yu <[email protected] 
> <mailto:[email protected]>> wrote:
> Well, my file is not in my local filesystem. It’s in GS. 
> This is the line of code that reads the input file: 
> p.apply(TextIO.Read.from("gs://apache-beam-samples/shakespeare/* <>"))
> 
> And this page https://beam.apache.org/get-started/quickstart/ 
> <https://beam.apache.org/get-started/quickstart/> says the following:
> "you can’t access a local file if you are running the pipeline on an external 
> cluster”.
> I’m indeed trying to run a pipeline on a standalone Spark cluster running on 
> my local machine. So local files are not an option.
> 
> 
>> On Jan 23, 2017, at 4:41 PM, Amit Sela <[email protected] 
>> <mailto:[email protected]>> wrote:
>> 
>> Why not try file:// instead ? it doesn't seem like you're using Google 
>> Storage, right ? I mean the input file is on your local FS.
>> 
>> On Mon, Jan 23, 2017 at 11:34 PM Chaoran Yu <[email protected] 
>> <mailto:[email protected]>> wrote:
>> No I’m not using Dataproc.
>> I’m simply running on my local machine. I started a local Spark cluster with 
>> sbin/start-master.sh and sbin/start-slave.sh. Then I submitted my Beam job 
>> to that cluster.
>> The gs file is the kinglear.txt from Beam’s example code and it should be 
>> public. 
>> 
>> My full stack trace is attached.
>> 
>> Thanks,
>> Chaoran
>> 
>> 
>> 
>>> On Jan 23, 2017, at 4:23 PM, Amit Sela <[email protected] 
>>> <mailto:[email protected]>> wrote:
>>> 
>>> Maybe, are you running on Dataproc ? are you using YARN/Mesos ? do the 
>>> machines hosting the executor processes have access to GS ? could you paste 
>>> the entire stack trace ?
>>> 
>>> On Mon, Jan 23, 2017 at 11:21 PM Chaoran Yu <[email protected] 
>>> <mailto:[email protected]>> wrote:
>>> Thank you Amit for the reply,
>>> 
>>> I just tried two more runners and below is a summary:
>>> 
>>> DirectRunner: works
>>> FlinkRunner: works in local mode. I got an error “Communication with 
>>> JobManager failed: lost connection to the JobManager” when running in 
>>> cluster mode, 
>>> SparkRunner: works in local mode (mvn exec command) but fails in cluster 
>>> mode (spark-submit) with the error I pasted in the previous email.
>>> 
>>> In SparkRunner’s case, can it be that Spark executor can’t access gs file 
>>> in Google Storage?
>>> 
>>> Thank you,
>>> 
>>> 
>>> 
>>>> On Jan 23, 2017, at 3:28 PM, Amit Sela <[email protected] 
>>>> <mailto:[email protected]>> wrote:
>>>> 
>>>> Is this working for you with other runners ? judging by the stack trace, 
>>>> it seems like IOChannelUtils fails to find a handler so it doesn't seem 
>>>> like it is a Spark specific problem. 
>>>> 
>>>> On Mon, Jan 23, 2017 at 8:50 PM Chaoran Yu <[email protected] 
>>>> <mailto:[email protected]>> wrote:
>>>> Thank you Amit and JB! 
>>>> 
>>>> This is not related to DC/OS itself, but I ran into a problem when 
>>>> launching a Spark job on a cluster with spark-submit. My Spark job written 
>>>> in Beam can’t read the specified gs file. I got the following error:
>>>> 
>>>> Caused by: java.io.IOException: Unable to find handler for 
>>>> gs://beam-samples/sample.txt <>
>>>>    at 
>>>> org.apache.beam.sdk.util.IOChannelUtils.getFactory(IOChannelUtils.java:307)
>>>>    at 
>>>> org.apache.beam.sdk.io.FileBasedSource$FileBasedReader.startImpl(FileBasedSource.java:528)
>>>>    at 
>>>> org.apache.beam.sdk.io.OffsetBasedSource$OffsetBasedReader.start(OffsetBasedSource.java:271)
>>>>    at 
>>>> org.apache.beam.runners.spark.io.SourceRDD$Bounded$1.hasNext(SourceRDD.java:125)
>>>> 
>>>> Then I thought about switching to reading from another source, but I saw 
>>>> in Beam’s documentation that TextIO can only read from files in Google 
>>>> Cloud Storage (prefixed with gs://) when running in cluster mode. How do 
>>>> you guys doing file IO in Beam when using the SparkRunner?
>>>> 
>>>> 
>>>> Thank you,
>>>> Chaoran
>>>> 
>>>> 
>>>>> On Jan 22, 2017, at 4:32 AM, Amit Sela <[email protected] 
>>>>> <mailto:[email protected]>> wrote:
>>>>> 
>>>>> I'lll join JB's comment on the Spark runner saying that submitting Beam 
>>>>> pipelines using the Spark runner can be done using Spark's spark-submit 
>>>>> script, find out more in the Spark runner documentation 
>>>>> <https://beam.apache.org/documentation/runners/spark/>.
>>>>> 
>>>>> Amit.
>>>>> 
>>>>> On Sun, Jan 22, 2017 at 8:03 AM Jean-Baptiste Onofré <[email protected] 
>>>>> <mailto:[email protected]>> wrote:
>>>>> Hi,
>>>>> 
>>>>> Not directly DCOS (I think Stephen did some test on it), but I have a
>>>>> platform running Spark and Flink with Beam on Mesos + Marathon.
>>>>> 
>>>>> It basically doesn't have anything special as running piplines uses
>>>>> spark-submit (as on in Spark "natively").
>>>>> 
>>>>> Regards
>>>>> JB
>>>>> 
>>>>> On 01/22/2017 12:56 AM, Chaoran Yu wrote:
>>>>> > Hello all,
>>>>> >
>>>>> >   Has anyone had experience using Beam on DC/OS? I want to run Beam code
>>>>> >
>>>>> > executed with Spark runner on DC/OS. As a next step, I would like to 
>>>>> > run the
>>>>> >
>>>>> > Flink runner as well. There doesn't seem to exist any information
>>>>> > about running
>>>>> >
>>>>> > Beam on DC/OS I can find on the web. So some pointers are greatly
>>>>> > appreciated.
>>>>> >
>>>>> > Thank you,
>>>>> >
>>>>> > Chaoran Yu
>>>>> >
>>>>> 
>>>>> --
>>>>> Jean-Baptiste Onofré
>>>>> [email protected] <mailto:[email protected]>
>>>>> http://blog.nanthrax.net <http://blog.nanthrax.net/>
>>>>> Talend - http://www.talend.com <http://www.talend.com/>
>>>> 
>>> 
>> 
>

Re: Beam Spark/Flink runner with DC/OS

Reply via email to