TextSocketMicroBatchReader no longer supports nc utility

2018-06-03 Thread Jungtaek Lim
Hi devs,

Not sure I can hear back the response sooner since Spark summit is just
around the corner, but just would want to post and wait.

While playing with Spark 2.4.0-SNAPSHOT, I found nc command exits before
reading actual data so the query also exits with error.

The reason is due to launching temporary reader for reading schema, and
closing reader, and re-opening reader. While reliable socket server should
be able to handle this without any issue, nc command normally can't handle
multiple connections and simply exits when closing temporary reader.

I would like to file an issue and contribute on fixing this if we think
this is a bug (otherwise we need to replace nc utility with another one,
maybe our own implementation?), but not sure we are happy to apply
workaround for specific source.

Would like to hear opinions before giving a shot.

Thanks,
Jungtaek Lim (HeartSaVioR)


Re: [VOTE] Spark 2.3.1 (RC4)

2018-06-03 Thread Hyukjin Kwon
+1

2018년 6월 3일 (일) 오후 9:25, Ricardo Almeida 님이
작성:

> +1 (non-binding)
>
> On 3 June 2018 at 09:23, Dongjoon Hyun  wrote:
>
>> +1
>>
>> Bests,
>> Dongjoon.
>>
>> On Sat, Jun 2, 2018 at 8:09 PM, Denny Lee  wrote:
>>
>>> +1
>>>
>>> On Sat, Jun 2, 2018 at 4:53 PM Nicholas Chammas <
>>> nicholas.cham...@gmail.com> wrote:
>>>
 I'll give that a try, but I'll still have to figure out what to do if
 none of the release builds work with hadoop-aws, since Flintrock deploys
 Spark release builds to set up a cluster. Building Spark is slow, so we
 only do it if the user specifically requests a Spark version by git hash.
 (This is basically how spark-ec2 did things, too.)


 On Sat, Jun 2, 2018 at 6:54 PM Marcelo Vanzin 
 wrote:

> If you're building your own Spark, definitely try the hadoop-cloud
> profile. Then you don't even need to pull anything at runtime,
> everything is already packaged with Spark.
>
> On Fri, Jun 1, 2018 at 6:51 PM, Nicholas Chammas
>  wrote:
> > pyspark --packages org.apache.hadoop:hadoop-aws:2.7.3 didn’t work
> for me
> > either (even building with -Phadoop-2.7). I guess I’ve been relying
> on an
> > unsupported pattern and will need to figure something else out going
> forward
> > in order to use s3a://.
> >
> >
> > On Fri, Jun 1, 2018 at 9:09 PM Marcelo Vanzin 
> wrote:
> >>
> >> I have personally never tried to include hadoop-aws that way. But at
> >> the very least, I'd try to use the same version of Hadoop as the
> Spark
> >> build (2.7.3 IIRC). I don't really expect a different version to
> work,
> >> and if it did in the past it definitely was not by design.
> >>
> >> On Fri, Jun 1, 2018 at 5:50 PM, Nicholas Chammas
> >>  wrote:
> >> > Building with -Phadoop-2.7 didn’t help, and if I remember
> correctly,
> >> > building with -Phadoop-2.8 worked with hadoop-aws in the 2.3.0
> release,
> >> > so
> >> > it appears something has changed since then.
> >> >
> >> > I wasn’t familiar with -Phadoop-cloud, but I can try that.
> >> >
> >> > My goal here is simply to confirm that this release of Spark
> works with
> >> > hadoop-aws like past releases did, particularly for Flintrock
> users who
> >> > use
> >> > Spark with S3A.
> >> >
> >> > We currently provide -hadoop2.6, -hadoop2.7, and -without-hadoop
> builds
> >> > with
> >> > every Spark release. If the -hadoop2.7 release build won’t work
> with
> >> > hadoop-aws anymore, are there plans to provide a new build type
> that
> >> > will?
> >> >
> >> > Apologies if the question is poorly formed. I’m batting a bit
> outside my
> >> > league here. Again, my goal is simply to confirm that I/my users
> still
> >> > have
> >> > a way to use s3a://. In the past, that way was simply to call
> pyspark
> >> > --packages org.apache.hadoop:hadoop-aws:2.8.4 or something very
> similar.
> >> > If
> >> > that will no longer work, I’m trying to confirm that the change of
> >> > behavior
> >> > is intentional or acceptable (as a review for the Spark project)
> and
> >> > figure
> >> > out what I need to change (as due diligence for Flintrock’s
> users).
> >> >
> >> > Nick
> >> >
> >> >
> >> > On Fri, Jun 1, 2018 at 8:21 PM Marcelo Vanzin <
> van...@cloudera.com>
> >> > wrote:
> >> >>
> >> >> Using the hadoop-aws package is probably going to be a little
> more
> >> >> complicated than that. The best bet is to use a custom build of
> Spark
> >> >> that includes it (use -Phadoop-cloud). Otherwise you're probably
> >> >> looking at some nasty dependency issues, especially if you end up
> >> >> mixing different versions of Hadoop.
> >> >>
> >> >> On Fri, Jun 1, 2018 at 4:01 PM, Nicholas Chammas
> >> >>  wrote:
> >> >> > I was able to successfully launch a Spark cluster on EC2 at
> 2.3.1 RC4
> >> >> > using
> >> >> > Flintrock. However, trying to load the hadoop-aws package gave
> me
> >> >> > some
> >> >> > errors.
> >> >> >
> >> >> > $ pyspark --packages org.apache.hadoop:hadoop-aws:2.8.4
> >> >> >
> >> >> > 
> >> >> >
> >> >> > :: problems summary ::
> >> >> >  WARNINGS
> >> >> > [NOT FOUND  ]
> >> >> > com.sun.jersey#jersey-json;1.9!jersey-json.jar(bundle) (2ms)
> >> >> >  local-m2-cache: tried
> >> >> >
> >> >> >
> >> >> >
> >> >> >
> file:/home/ec2-user/.m2/repository/com/sun/jersey/jersey-json/1.9/jersey-json-1.9.jar
> >> >> > [NOT FOUND  ]
> >> >> > com.sun.jersey#jersey-server;1.9!jersey-server.jar(bundle)
> (0ms)
> >> >> >  local-m2-cache: tried
> >> >> >
> >> >> >
> >> >> >
> >> >> >
> 

Re: [VOTE] [SPARK-24374] SPIP: Support Barrier Scheduling in Apache Spark

2018-06-03 Thread Weichen Xu
+1

On Fri, Jun 1, 2018 at 3:41 PM, Xiao Li  wrote:

> +1
>
> 2018-06-01 15:41 GMT-07:00 Xingbo Jiang :
>
>> +1
>>
>> 2018-06-01 9:21 GMT-07:00 Xiangrui Meng :
>>
>>> Hi all,
>>>
>>> I want to call for a vote of SPARK-24374
>>> . It introduces a
>>> new execution mode to Spark, which would help both integration with
>>> external DL/AI frameworks and MLlib algorithm performance. This is one of
>>> the follow-ups from a previous discussion on dev@
>>> 
>>> .
>>>
>>> The vote will be up for the next 72 hours. Please reply with your vote:
>>>
>>> +1: Yeah, let's go forward and implement the SPIP.
>>> +0: Don't really care.
>>> -1: I don't think this is a good idea because of the following technical
>>> reasons.
>>>
>>> Best,
>>> Xiangrui
>>> --
>>>
>>> Xiangrui Meng
>>>
>>> Software Engineer
>>>
>>> Databricks Inc. [image: http://databricks.com] 
>>>
>>
>>
>


Re: [VOTE] Spark 2.3.1 (RC4)

2018-06-03 Thread Ricardo Almeida
+1 (non-binding)

On 3 June 2018 at 09:23, Dongjoon Hyun  wrote:

> +1
>
> Bests,
> Dongjoon.
>
> On Sat, Jun 2, 2018 at 8:09 PM, Denny Lee  wrote:
>
>> +1
>>
>> On Sat, Jun 2, 2018 at 4:53 PM Nicholas Chammas <
>> nicholas.cham...@gmail.com> wrote:
>>
>>> I'll give that a try, but I'll still have to figure out what to do if
>>> none of the release builds work with hadoop-aws, since Flintrock deploys
>>> Spark release builds to set up a cluster. Building Spark is slow, so we
>>> only do it if the user specifically requests a Spark version by git hash.
>>> (This is basically how spark-ec2 did things, too.)
>>>
>>>
>>> On Sat, Jun 2, 2018 at 6:54 PM Marcelo Vanzin 
>>> wrote:
>>>
 If you're building your own Spark, definitely try the hadoop-cloud
 profile. Then you don't even need to pull anything at runtime,
 everything is already packaged with Spark.

 On Fri, Jun 1, 2018 at 6:51 PM, Nicholas Chammas
  wrote:
 > pyspark --packages org.apache.hadoop:hadoop-aws:2.7.3 didn’t work
 for me
 > either (even building with -Phadoop-2.7). I guess I’ve been relying
 on an
 > unsupported pattern and will need to figure something else out going
 forward
 > in order to use s3a://.
 >
 >
 > On Fri, Jun 1, 2018 at 9:09 PM Marcelo Vanzin 
 wrote:
 >>
 >> I have personally never tried to include hadoop-aws that way. But at
 >> the very least, I'd try to use the same version of Hadoop as the
 Spark
 >> build (2.7.3 IIRC). I don't really expect a different version to
 work,
 >> and if it did in the past it definitely was not by design.
 >>
 >> On Fri, Jun 1, 2018 at 5:50 PM, Nicholas Chammas
 >>  wrote:
 >> > Building with -Phadoop-2.7 didn’t help, and if I remember
 correctly,
 >> > building with -Phadoop-2.8 worked with hadoop-aws in the 2.3.0
 release,
 >> > so
 >> > it appears something has changed since then.
 >> >
 >> > I wasn’t familiar with -Phadoop-cloud, but I can try that.
 >> >
 >> > My goal here is simply to confirm that this release of Spark works
 with
 >> > hadoop-aws like past releases did, particularly for Flintrock
 users who
 >> > use
 >> > Spark with S3A.
 >> >
 >> > We currently provide -hadoop2.6, -hadoop2.7, and -without-hadoop
 builds
 >> > with
 >> > every Spark release. If the -hadoop2.7 release build won’t work
 with
 >> > hadoop-aws anymore, are there plans to provide a new build type
 that
 >> > will?
 >> >
 >> > Apologies if the question is poorly formed. I’m batting a bit
 outside my
 >> > league here. Again, my goal is simply to confirm that I/my users
 still
 >> > have
 >> > a way to use s3a://. In the past, that way was simply to call
 pyspark
 >> > --packages org.apache.hadoop:hadoop-aws:2.8.4 or something very
 similar.
 >> > If
 >> > that will no longer work, I’m trying to confirm that the change of
 >> > behavior
 >> > is intentional or acceptable (as a review for the Spark project)
 and
 >> > figure
 >> > out what I need to change (as due diligence for Flintrock’s users).
 >> >
 >> > Nick
 >> >
 >> >
 >> > On Fri, Jun 1, 2018 at 8:21 PM Marcelo Vanzin >>> >
 >> > wrote:
 >> >>
 >> >> Using the hadoop-aws package is probably going to be a little more
 >> >> complicated than that. The best bet is to use a custom build of
 Spark
 >> >> that includes it (use -Phadoop-cloud). Otherwise you're probably
 >> >> looking at some nasty dependency issues, especially if you end up
 >> >> mixing different versions of Hadoop.
 >> >>
 >> >> On Fri, Jun 1, 2018 at 4:01 PM, Nicholas Chammas
 >> >>  wrote:
 >> >> > I was able to successfully launch a Spark cluster on EC2 at
 2.3.1 RC4
 >> >> > using
 >> >> > Flintrock. However, trying to load the hadoop-aws package gave
 me
 >> >> > some
 >> >> > errors.
 >> >> >
 >> >> > $ pyspark --packages org.apache.hadoop:hadoop-aws:2.8.4
 >> >> >
 >> >> > 
 >> >> >
 >> >> > :: problems summary ::
 >> >> >  WARNINGS
 >> >> > [NOT FOUND  ]
 >> >> > com.sun.jersey#jersey-json;1.9!jersey-json.jar(bundle) (2ms)
 >> >> >  local-m2-cache: tried
 >> >> >
 >> >> >
 >> >> >
 >> >> > file:/home/ec2-user/.m2/repository/com/sun/jersey/jersey-
 json/1.9/jersey-json-1.9.jar
 >> >> > [NOT FOUND  ]
 >> >> > com.sun.jersey#jersey-server;1.9!jersey-server.jar(bundle)
 (0ms)
 >> >> >  local-m2-cache: tried
 >> >> >
 >> >> >
 >> >> >
 >> >> > file:/home/ec2-user/.m2/repository/com/sun/jersey/jersey-
 server/1.9/jersey-server-1.9.jar
 >> >> > [NOT FOUND  ]
 >> >> > org.codehaus.jettison#jettison;1.1!jettison.jar(bundle) (1ms)
 >> >> >  

Re: Revisiting Online serving of Spark models?

2018-06-03 Thread Holden Karau
On Sat, Jun 2, 2018 at 8:39 PM, Maximiliano Felice <
maximilianofel...@gmail.com> wrote:

> Hi!
>
> We're already in San Francisco waiting for the summit. We even think that
> we spotted @holdenk this afternoon.
>
Unless you happened to be walking by my garage probably not super likely,
spent the day working on scooters/motorcycles (my style is a little less
unique in SF :)). Also if you see me feel free to say hi unless I look like
I haven't had my first coffee of the day, love chatting with folks IRL :)

>
> @chris, we're really interested in the Meetup you're hosting. My team will
> probably join it since the beginning of you have room for us, and I'll join
> it later after discussing the topics on this thread. I'll send you an email
> regarding this request.
>
> Thanks
>
> El vie., 1 de jun. de 2018 7:26 AM, Saikat Kanjilal 
> escribió:
>
>> @Chris This sounds fantastic, please send summary notes for Seattle folks
>>
>> @Felix I work in downtown Seattle, am wondering if we should a tech
>> meetup around model serving in spark at my work or elsewhere close,
>> thoughts?  I’m actually in the midst of building microservices to manage
>> models and when I say models I mean much more than machine learning models
>> (think OR, process models as well)
>>
>> Regards
>>
>> Sent from my iPhone
>>
>> On May 31, 2018, at 10:32 PM, Chris Fregly  wrote:
>>
>> Hey everyone!
>>
>> @Felix:  thanks for putting this together.  i sent some of you a quick
>> calendar event - mostly for me, so i don’t forget!  :)
>>
>> Coincidentally, this is the focus of June 6th's *Advanced Spark and
>> TensorFlow Meetup*
>> 
>>  @5:30pm
>> on June 6th (same night) here in SF!
>>
>> Everybody is welcome to come.  Here’s the link to the meetup that
>> includes the signup link:
>> *https://www.meetup.com/Advanced-Spark-and-TensorFlow-Meetup/events/250924195/*
>> 
>>
>> We have an awesome lineup of speakers covered a lot of deep, technical
>> ground.
>>
>> For those who can’t attend in person, we’ll be broadcasting live - and
>> posting the recording afterward.
>>
>> All details are in the meetup link above…
>>
>> @holden/felix/nick/joseph/maximiliano/saikat/leif:  you’re more than
>> welcome to give a talk. I can move things around to make room.
>>
>> @joseph:  I’d personally like an update on the direction of the
>> Databricks proprietary ML Serving export format which is similar to PMML
>> but not a standard in any way.
>>
>> Also, the Databricks ML Serving Runtime is only available to Databricks
>> customers.  This seems in conflict with the community efforts described
>> here.  Can you comment on behalf of Databricks?
>>
>> Look forward to your response, joseph.
>>
>> See you all soon!
>>
>> —
>>
>>
>> *Chris Fregly *Founder @ *PipelineAI*  (100,000
>> Users)
>> Organizer @ *Advanced Spark and TensorFlow Meetup*
>>  (85,000
>> Global Members)
>>
>>
>>
>> *San Francisco - Chicago - Austin -  Washington DC - London - Dusseldorf *
>> *Try our PipelineAI Community Edition with GPUs and TPUs!!
>> *
>>
>>
>> On May 30, 2018, at 9:32 AM, Felix Cheung 
>> wrote:
>>
>> Hi!
>>
>> Thank you! Let’s meet then
>>
>> June 6 4pm
>>
>> Moscone West Convention Center
>> 800 Howard Street, San Francisco, CA 94103
>> 
>>
>> Ground floor (outside of conference area - should be available for all) -
>> we will meet and decide where to go
>>
>> (Would not send invite because that would be too much noise for dev@)
>>
>> To paraphrase Joseph, we will use this to kick off the discusssion and
>> post notes after and follow up online. As for Seattle, I would be very
>> interested to meet in person lateen and discuss ;)
>>
>>
>> _
>> From: Saikat Kanjilal 
>> Sent: Tuesday, May 29, 2018 11:46 AM
>> Subject: Re: Revisiting Online serving of Spark models?
>> To: Maximiliano Felice 
>> Cc: Felix Cheung , Holden Karau <
>> hol...@pigscanfly.ca>, Joseph Bradley , Leif
>> Walsh , dev 
>>
>>
>> Would love to join but am in Seattle, thoughts on how to make this work?
>>
>> Regards
>>
>> Sent from my iPhone
>>
>> On May 29, 2018, at 10:35 AM, Maximiliano Felice <
>> maximilianofel...@gmail.com> wrote:
>>
>> Big +1 to a meeting with fresh air.
>>
>> Could anyone send the invites? I don't really know which is the place
>> Holden is talking about.
>>
>> 2018-05-29 14:27 GMT-03:00 Felix Cheung :
>>
>>> You had me at blue bottle!
>>>
>>> _
>>> From: Holden Karau 
>>> Sent: Tuesday, May 29, 2018 9:47 AM
>>> Subject: Re: Revisiting Online serving of Spark models?
>>> To: Felix Cheung 
>>> Cc: Saikat Kanjilal , Maximiliano Felice <
>>> maximilianofel...@gmail.com>, Joseph Bradley ,

Re: [VOTE] Spark 2.3.1 (RC4)

2018-06-03 Thread Dongjoon Hyun
+1

Bests,
Dongjoon.

On Sat, Jun 2, 2018 at 8:09 PM, Denny Lee  wrote:

> +1
>
> On Sat, Jun 2, 2018 at 4:53 PM Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> I'll give that a try, but I'll still have to figure out what to do if
>> none of the release builds work with hadoop-aws, since Flintrock deploys
>> Spark release builds to set up a cluster. Building Spark is slow, so we
>> only do it if the user specifically requests a Spark version by git hash.
>> (This is basically how spark-ec2 did things, too.)
>>
>>
>> On Sat, Jun 2, 2018 at 6:54 PM Marcelo Vanzin 
>> wrote:
>>
>>> If you're building your own Spark, definitely try the hadoop-cloud
>>> profile. Then you don't even need to pull anything at runtime,
>>> everything is already packaged with Spark.
>>>
>>> On Fri, Jun 1, 2018 at 6:51 PM, Nicholas Chammas
>>>  wrote:
>>> > pyspark --packages org.apache.hadoop:hadoop-aws:2.7.3 didn’t work for
>>> me
>>> > either (even building with -Phadoop-2.7). I guess I’ve been relying on
>>> an
>>> > unsupported pattern and will need to figure something else out going
>>> forward
>>> > in order to use s3a://.
>>> >
>>> >
>>> > On Fri, Jun 1, 2018 at 9:09 PM Marcelo Vanzin 
>>> wrote:
>>> >>
>>> >> I have personally never tried to include hadoop-aws that way. But at
>>> >> the very least, I'd try to use the same version of Hadoop as the Spark
>>> >> build (2.7.3 IIRC). I don't really expect a different version to work,
>>> >> and if it did in the past it definitely was not by design.
>>> >>
>>> >> On Fri, Jun 1, 2018 at 5:50 PM, Nicholas Chammas
>>> >>  wrote:
>>> >> > Building with -Phadoop-2.7 didn’t help, and if I remember correctly,
>>> >> > building with -Phadoop-2.8 worked with hadoop-aws in the 2.3.0
>>> release,
>>> >> > so
>>> >> > it appears something has changed since then.
>>> >> >
>>> >> > I wasn’t familiar with -Phadoop-cloud, but I can try that.
>>> >> >
>>> >> > My goal here is simply to confirm that this release of Spark works
>>> with
>>> >> > hadoop-aws like past releases did, particularly for Flintrock users
>>> who
>>> >> > use
>>> >> > Spark with S3A.
>>> >> >
>>> >> > We currently provide -hadoop2.6, -hadoop2.7, and -without-hadoop
>>> builds
>>> >> > with
>>> >> > every Spark release. If the -hadoop2.7 release build won’t work with
>>> >> > hadoop-aws anymore, are there plans to provide a new build type that
>>> >> > will?
>>> >> >
>>> >> > Apologies if the question is poorly formed. I’m batting a bit
>>> outside my
>>> >> > league here. Again, my goal is simply to confirm that I/my users
>>> still
>>> >> > have
>>> >> > a way to use s3a://. In the past, that way was simply to call
>>> pyspark
>>> >> > --packages org.apache.hadoop:hadoop-aws:2.8.4 or something very
>>> similar.
>>> >> > If
>>> >> > that will no longer work, I’m trying to confirm that the change of
>>> >> > behavior
>>> >> > is intentional or acceptable (as a review for the Spark project) and
>>> >> > figure
>>> >> > out what I need to change (as due diligence for Flintrock’s users).
>>> >> >
>>> >> > Nick
>>> >> >
>>> >> >
>>> >> > On Fri, Jun 1, 2018 at 8:21 PM Marcelo Vanzin 
>>> >> > wrote:
>>> >> >>
>>> >> >> Using the hadoop-aws package is probably going to be a little more
>>> >> >> complicated than that. The best bet is to use a custom build of
>>> Spark
>>> >> >> that includes it (use -Phadoop-cloud). Otherwise you're probably
>>> >> >> looking at some nasty dependency issues, especially if you end up
>>> >> >> mixing different versions of Hadoop.
>>> >> >>
>>> >> >> On Fri, Jun 1, 2018 at 4:01 PM, Nicholas Chammas
>>> >> >>  wrote:
>>> >> >> > I was able to successfully launch a Spark cluster on EC2 at
>>> 2.3.1 RC4
>>> >> >> > using
>>> >> >> > Flintrock. However, trying to load the hadoop-aws package gave me
>>> >> >> > some
>>> >> >> > errors.
>>> >> >> >
>>> >> >> > $ pyspark --packages org.apache.hadoop:hadoop-aws:2.8.4
>>> >> >> >
>>> >> >> > 
>>> >> >> >
>>> >> >> > :: problems summary ::
>>> >> >> >  WARNINGS
>>> >> >> > [NOT FOUND  ]
>>> >> >> > com.sun.jersey#jersey-json;1.9!jersey-json.jar(bundle) (2ms)
>>> >> >> >  local-m2-cache: tried
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> > file:/home/ec2-user/.m2/repository/com/sun/jersey/
>>> jersey-json/1.9/jersey-json-1.9.jar
>>> >> >> > [NOT FOUND  ]
>>> >> >> > com.sun.jersey#jersey-server;1.9!jersey-server.jar(bundle) (0ms)
>>> >> >> >  local-m2-cache: tried
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> > file:/home/ec2-user/.m2/repository/com/sun/jersey/
>>> jersey-server/1.9/jersey-server-1.9.jar
>>> >> >> > [NOT FOUND  ]
>>> >> >> > org.codehaus.jettison#jettison;1.1!jettison.jar(bundle) (1ms)
>>> >> >> >  local-m2-cache: tried
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> > file:/home/ec2-user/.m2/repository/org/codehaus/
>>> jettison/jettison/1.1/jettison-1.1.jar
>>> >> >> > [NOT FOUND  ]
>>> >> >> >