Sorry for the spam. But to clarify, I didn’t write the code. I’m using the code 
described here https://beam.apache.org/get-started/wordcount-example/ 
<https://beam.apache.org/get-started/wordcount-example/>
So the file already exists in GS.

> On Jan 23, 2017, at 4:55 PM, Chaoran Yu <[email protected]> wrote:
> 
> I didn’t upload the file. But since the identical Beam code, when running in 
> Spark local mode, was able to fetch the file and process it, the file does 
> exist.
> It’s just that somehow Spark standalone mode can’t find the file.
> 
> 
>> On Jan 23, 2017, at 4:50 PM, Amit Sela <[email protected] 
>> <mailto:[email protected]>> wrote:
>> 
>> I think "external" is the key here, you're cluster is running all it's 
>> components on your local machine so you're good.
>> 
>> As for GS, it's like Amazon's S3 or sort-of a cloud service HDFS offered by 
>> Google. You need to upload your file to GS. Have you ?  
>> 
>> On Mon, Jan 23, 2017 at 11:47 PM Chaoran Yu <[email protected] 
>> <mailto:[email protected]>> wrote:
>> Well, my file is not in my local filesystem. It’s in GS. 
>> This is the line of code that reads the input file: 
>> p.apply(TextIO.Read.from("gs://apache-beam-samples/shakespeare/* <>"))
>> 
>> And this page https://beam.apache.org/get-started/quickstart/ 
>> <https://beam.apache.org/get-started/quickstart/> says the following:
>> "you can’t access a local file if you are running the pipeline on an 
>> external cluster”.
>> I’m indeed trying to run a pipeline on a standalone Spark cluster running on 
>> my local machine. So local files are not an option.
>> 
>> 
>>> On Jan 23, 2017, at 4:41 PM, Amit Sela <[email protected] 
>>> <mailto:[email protected]>> wrote:
>>> 
>>> Why not try file:// instead ? it doesn't seem like you're using Google 
>>> Storage, right ? I mean the input file is on your local FS.
>>> 
>>> On Mon, Jan 23, 2017 at 11:34 PM Chaoran Yu <[email protected] 
>>> <mailto:[email protected]>> wrote:
>>> No I’m not using Dataproc.
>>> I’m simply running on my local machine. I started a local Spark cluster 
>>> with sbin/start-master.sh and sbin/start-slave.sh. Then I submitted my Beam 
>>> job to that cluster.
>>> The gs file is the kinglear.txt from Beam’s example code and it should be 
>>> public. 
>>> 
>>> My full stack trace is attached.
>>> 
>>> Thanks,
>>> Chaoran
>>> 
>>> 
>>> 
>>>> On Jan 23, 2017, at 4:23 PM, Amit Sela <[email protected] 
>>>> <mailto:[email protected]>> wrote:
>>>> 
>>>> Maybe, are you running on Dataproc ? are you using YARN/Mesos ? do the 
>>>> machines hosting the executor processes have access to GS ? could you 
>>>> paste the entire stack trace ?
>>>> 
>>>> On Mon, Jan 23, 2017 at 11:21 PM Chaoran Yu <[email protected] 
>>>> <mailto:[email protected]>> wrote:
>>>> Thank you Amit for the reply,
>>>> 
>>>> I just tried two more runners and below is a summary:
>>>> 
>>>> DirectRunner: works
>>>> FlinkRunner: works in local mode. I got an error “Communication with 
>>>> JobManager failed: lost connection to the JobManager” when running in 
>>>> cluster mode, 
>>>> SparkRunner: works in local mode (mvn exec command) but fails in cluster 
>>>> mode (spark-submit) with the error I pasted in the previous email.
>>>> 
>>>> In SparkRunner’s case, can it be that Spark executor can’t access gs file 
>>>> in Google Storage?
>>>> 
>>>> Thank you,
>>>> 
>>>> 
>>>> 
>>>>> On Jan 23, 2017, at 3:28 PM, Amit Sela <[email protected] 
>>>>> <mailto:[email protected]>> wrote:
>>>>> 
>>>>> Is this working for you with other runners ? judging by the stack trace, 
>>>>> it seems like IOChannelUtils fails to find a handler so it doesn't seem 
>>>>> like it is a Spark specific problem. 
>>>>> 
>>>>> On Mon, Jan 23, 2017 at 8:50 PM Chaoran Yu <[email protected] 
>>>>> <mailto:[email protected]>> wrote:
>>>>> Thank you Amit and JB! 
>>>>> 
>>>>> This is not related to DC/OS itself, but I ran into a problem when 
>>>>> launching a Spark job on a cluster with spark-submit. My Spark job 
>>>>> written in Beam can’t read the specified gs file. I got the following 
>>>>> error:
>>>>> 
>>>>> Caused by: java.io.IOException: Unable to find handler for 
>>>>> gs://beam-samples/sample.txt <>
>>>>>   at 
>>>>> org.apache.beam.sdk.util.IOChannelUtils.getFactory(IOChannelUtils.java:307)
>>>>>   at 
>>>>> org.apache.beam.sdk.io.FileBasedSource$FileBasedReader.startImpl(FileBasedSource.java:528)
>>>>>   at 
>>>>> org.apache.beam.sdk.io.OffsetBasedSource$OffsetBasedReader.start(OffsetBasedSource.java:271)
>>>>>   at 
>>>>> org.apache.beam.runners.spark.io.SourceRDD$Bounded$1.hasNext(SourceRDD.java:125)
>>>>> 
>>>>> Then I thought about switching to reading from another source, but I saw 
>>>>> in Beam’s documentation that TextIO can only read from files in Google 
>>>>> Cloud Storage (prefixed with gs://) when running in cluster mode. How do 
>>>>> you guys doing file IO in Beam when using the SparkRunner?
>>>>> 
>>>>> 
>>>>> Thank you,
>>>>> Chaoran
>>>>> 
>>>>> 
>>>>>> On Jan 22, 2017, at 4:32 AM, Amit Sela <[email protected] 
>>>>>> <mailto:[email protected]>> wrote:
>>>>>> 
>>>>>> I'lll join JB's comment on the Spark runner saying that submitting Beam 
>>>>>> pipelines using the Spark runner can be done using Spark's spark-submit 
>>>>>> script, find out more in the Spark runner documentation 
>>>>>> <https://beam.apache.org/documentation/runners/spark/>.
>>>>>> 
>>>>>> Amit.
>>>>>> 
>>>>>> On Sun, Jan 22, 2017 at 8:03 AM Jean-Baptiste Onofré <[email protected] 
>>>>>> <mailto:[email protected]>> wrote:
>>>>>> Hi,
>>>>>> 
>>>>>> Not directly DCOS (I think Stephen did some test on it), but I have a
>>>>>> platform running Spark and Flink with Beam on Mesos + Marathon.
>>>>>> 
>>>>>> It basically doesn't have anything special as running piplines uses
>>>>>> spark-submit (as on in Spark "natively").
>>>>>> 
>>>>>> Regards
>>>>>> JB
>>>>>> 
>>>>>> On 01/22/2017 12:56 AM, Chaoran Yu wrote:
>>>>>> > Hello all,
>>>>>> >
>>>>>> >   Has anyone had experience using Beam on DC/OS? I want to run Beam 
>>>>>> > code
>>>>>> >
>>>>>> > executed with Spark runner on DC/OS. As a next step, I would like to 
>>>>>> > run the
>>>>>> >
>>>>>> > Flink runner as well. There doesn't seem to exist any information
>>>>>> > about running
>>>>>> >
>>>>>> > Beam on DC/OS I can find on the web. So some pointers are greatly
>>>>>> > appreciated.
>>>>>> >
>>>>>> > Thank you,
>>>>>> >
>>>>>> > Chaoran Yu
>>>>>> >
>>>>>> 
>>>>>> --
>>>>>> Jean-Baptiste Onofré
>>>>>> [email protected] <mailto:[email protected]>
>>>>>> http://blog.nanthrax.net <http://blog.nanthrax.net/>
>>>>>> Talend - http://www.talend.com <http://www.talend.com/>
>>>>> 
>>>> 
>>> 
>> 
> 

Reply via email to