Well, my file is not in my local filesystem. It’s in GS.
This is the line of code that reads the input file:
p.apply(TextIO.Read.from("gs://apache-beam-samples/shakespeare/*"))
And this page https://beam.apache.org/get-started/quickstart/
<https://beam.apache.org/get-started/quickstart/> says the following:
"you can’t access a local file if you are running the pipeline on an external
cluster”.
I’m indeed trying to run a pipeline on a standalone Spark cluster running on my
local machine. So local files are not an option.
> On Jan 23, 2017, at 4:41 PM, Amit Sela <[email protected]> wrote:
>
> Why not try file:// instead ? it doesn't seem like you're using Google
> Storage, right ? I mean the input file is on your local FS.
>
> On Mon, Jan 23, 2017 at 11:34 PM Chaoran Yu <[email protected]
> <mailto:[email protected]>> wrote:
> No I’m not using Dataproc.
> I’m simply running on my local machine. I started a local Spark cluster with
> sbin/start-master.sh and sbin/start-slave.sh. Then I submitted my Beam job to
> that cluster.
> The gs file is the kinglear.txt from Beam’s example code and it should be
> public.
>
> My full stack trace is attached.
>
> Thanks,
> Chaoran
>
>
>
>> On Jan 23, 2017, at 4:23 PM, Amit Sela <[email protected]
>> <mailto:[email protected]>> wrote:
>>
>> Maybe, are you running on Dataproc ? are you using YARN/Mesos ? do the
>> machines hosting the executor processes have access to GS ? could you paste
>> the entire stack trace ?
>>
>> On Mon, Jan 23, 2017 at 11:21 PM Chaoran Yu <[email protected]
>> <mailto:[email protected]>> wrote:
>> Thank you Amit for the reply,
>>
>> I just tried two more runners and below is a summary:
>>
>> DirectRunner: works
>> FlinkRunner: works in local mode. I got an error “Communication with
>> JobManager failed: lost connection to the JobManager” when running in
>> cluster mode,
>> SparkRunner: works in local mode (mvn exec command) but fails in cluster
>> mode (spark-submit) with the error I pasted in the previous email.
>>
>> In SparkRunner’s case, can it be that Spark executor can’t access gs file in
>> Google Storage?
>>
>> Thank you,
>>
>>
>>
>>> On Jan 23, 2017, at 3:28 PM, Amit Sela <[email protected]
>>> <mailto:[email protected]>> wrote:
>>>
>>> Is this working for you with other runners ? judging by the stack trace, it
>>> seems like IOChannelUtils fails to find a handler so it doesn't seem like
>>> it is a Spark specific problem.
>>>
>>> On Mon, Jan 23, 2017 at 8:50 PM Chaoran Yu <[email protected]
>>> <mailto:[email protected]>> wrote:
>>> Thank you Amit and JB!
>>>
>>> This is not related to DC/OS itself, but I ran into a problem when
>>> launching a Spark job on a cluster with spark-submit. My Spark job written
>>> in Beam can’t read the specified gs file. I got the following error:
>>>
>>> Caused by: java.io.IOException: Unable to find handler for
>>> gs://beam-samples/sample.txt <>
>>> at
>>> org.apache.beam.sdk.util.IOChannelUtils.getFactory(IOChannelUtils.java:307)
>>> at
>>> org.apache.beam.sdk.io.FileBasedSource$FileBasedReader.startImpl(FileBasedSource.java:528)
>>> at
>>> org.apache.beam.sdk.io.OffsetBasedSource$OffsetBasedReader.start(OffsetBasedSource.java:271)
>>> at
>>> org.apache.beam.runners.spark.io.SourceRDD$Bounded$1.hasNext(SourceRDD.java:125)
>>>
>>> Then I thought about switching to reading from another source, but I saw in
>>> Beam’s documentation that TextIO can only read from files in Google Cloud
>>> Storage (prefixed with gs://) when running in cluster mode. How do you guys
>>> doing file IO in Beam when using the SparkRunner?
>>>
>>>
>>> Thank you,
>>> Chaoran
>>>
>>>
>>>> On Jan 22, 2017, at 4:32 AM, Amit Sela <[email protected]
>>>> <mailto:[email protected]>> wrote:
>>>>
>>>> I'lll join JB's comment on the Spark runner saying that submitting Beam
>>>> pipelines using the Spark runner can be done using Spark's spark-submit
>>>> script, find out more in the Spark runner documentation
>>>> <https://beam.apache.org/documentation/runners/spark/>.
>>>>
>>>> Amit.
>>>>
>>>> On Sun, Jan 22, 2017 at 8:03 AM Jean-Baptiste Onofré <[email protected]
>>>> <mailto:[email protected]>> wrote:
>>>> Hi,
>>>>
>>>> Not directly DCOS (I think Stephen did some test on it), but I have a
>>>> platform running Spark and Flink with Beam on Mesos + Marathon.
>>>>
>>>> It basically doesn't have anything special as running piplines uses
>>>> spark-submit (as on in Spark "natively").
>>>>
>>>> Regards
>>>> JB
>>>>
>>>> On 01/22/2017 12:56 AM, Chaoran Yu wrote:
>>>> > Hello all,
>>>> >
>>>> > Has anyone had experience using Beam on DC/OS? I want to run Beam code
>>>> >
>>>> > executed with Spark runner on DC/OS. As a next step, I would like to run
>>>> > the
>>>> >
>>>> > Flink runner as well. There doesn't seem to exist any information
>>>> > about running
>>>> >
>>>> > Beam on DC/OS I can find on the web. So some pointers are greatly
>>>> > appreciated.
>>>> >
>>>> > Thank you,
>>>> >
>>>> > Chaoran Yu
>>>> >
>>>>
>>>> --
>>>> Jean-Baptiste Onofré
>>>> [email protected] <mailto:[email protected]>
>>>> http://blog.nanthrax.net <http://blog.nanthrax.net/>
>>>> Talend - http://www.talend.com <http://www.talend.com/>
>>>
>>
>