Re: queryable state & streaming

2017-12-08 Thread Michael Armbrust
https://issues.apache.org/jira/browse/SPARK-16738

I don't believe anyone is working on it yet.  I think the most useful thing
is to start enumerating requirements and use cases and then we can talk
about how to build it.

On Fri, Dec 8, 2017 at 10:47 AM, Stavros Kontopoulos <
st.kontopou...@gmail.com> wrote:

> Cool Burak do you have a pointer, should I take the initiative for a first
> design document or Databricks is working on it?
>
> Best,
> Stavros
>
> On Fri, Dec 8, 2017 at 8:40 PM, Burak Yavuz  wrote:
>
>> Hi Stavros,
>>
>> Queryable state is definitely on the roadmap! We will revamp the
>> StateStore API a bit, and a queryable StateStore is definitely one of the
>> things we are thinking about during that revamp.
>>
>> Best,
>> Burak
>>
>> On Dec 8, 2017 9:57 AM, "Stavros Kontopoulos" 
>> wrote:
>>
>>> Just to re-phrase my question: Would query-able state make a viable
>>> SPIP?
>>>
>>> Regards,
>>> Stavros
>>>
>>> On Thu, Dec 7, 2017 at 1:34 PM, Stavros Kontopoulos <
>>> st.kontopou...@gmail.com> wrote:
>>>
 Hi,

 Maybe this has been discussed before. Given the fact that many
 streaming apps out there use state extensively, could be a good idea to
 make Spark expose streaming state with an external API like other
 systems do (Kafka streams, Flink etc), in order to facilitate
 interactive queries?

 Regards,
 Stavros

>>>
>>>
>


Re: queryable state & streaming

2017-12-08 Thread Burak Yavuz
Hi Stavros,

Queryable state is definitely on the roadmap! We will revamp the StateStore
API a bit, and a queryable StateStore is definitely one of the things we
are thinking about during that revamp.

Best,
Burak

On Dec 8, 2017 9:57 AM, "Stavros Kontopoulos" 
wrote:

> Just to re-phrase my question: Would query-able state make a viable SPIP?
>
> Regards,
> Stavros
>
> On Thu, Dec 7, 2017 at 1:34 PM, Stavros Kontopoulos <
> st.kontopou...@gmail.com> wrote:
>
>> Hi,
>>
>> Maybe this has been discussed before. Given the fact that many streaming
>> apps out there use state extensively, could be a good idea to make Spark
>> expose streaming state with an external API like other systems do (Kafka
>> streams, Flink etc), in order to facilitate interactive queries?
>>
>> Regards,
>> Stavros
>>
>
>


Re: queryable state & streaming

2017-12-08 Thread Stavros Kontopoulos
Just to re-phrase my question: Would query-able state make a viable SPIP?

Regards,
Stavros

On Thu, Dec 7, 2017 at 1:34 PM, Stavros Kontopoulos <
st.kontopou...@gmail.com> wrote:

> Hi,
>
> Maybe this has been discussed before. Given the fact that many streaming
> apps out there use state extensively, could be a good idea to make Spark
> expose streaming state with an external API like other systems do (Kafka
> streams, Flink etc), in order to facilitate interactive queries?
>
> Regards,
> Stavros
>


Re: BUILD FAILURE due to...not found: value AnalysisBarrier in spark-catalyst_2.11?

2017-12-08 Thread Sean Owen
Build is fine for me, and on Jenkins. Try a clean build?

On Fri, Dec 8, 2017 at 11:04 AM Jacek Laskowski  wrote:

> Hi,
>
> Just got BUILD FAILURE and have been wondering if it's just me or is this
> a known issue that's being worked on?
>
> (Sorry if that's just my local setup that I got broken)
>
> [INFO] --- scala-maven-plugin:3.2.2:compile (scala-compile-first) @
> spark-catalyst_2.11 ---
> [INFO] Using zinc server for incremental compilation
> [warn] Pruning sources from previous analysis, due to incompatible
> CompileSetup.
> [info] Compiling 222 Scala sources and 27 Java sources to
> /Users/jacek/dev/oss/spark/sql/catalyst/target/scala-2.11/classes...
> [error]
> /Users/jacek/dev/oss/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:728:
> not found: value AnalysisBarrier
> [error]   AnalysisBarrier(newRight)
> [error]   ^
> [error]
> /Users/jacek/dev/oss/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:1081:
> not found: value AnalysisBarrier
> [error]   case sa @ Sort(_, _, AnalysisBarrier(child: Aggregate)) => sa
> [error]^
> [error]
> /Users/jacek/dev/oss/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:1134:
> not found: value AnalysisBarrier
> [error] return AnalysisBarrier(plan)
> [error]^
> [error]
> /Users/jacek/dev/oss/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:1409:
> not found: value AnalysisBarrier
> [error]   case filter @ Filter(havingCondition,
> AnalysisBarrier(aggregate: Aggregate)) =>
> [error] ^
> [error]
> /Users/jacek/dev/oss/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:1410:
> not found: value AnalysisBarrier
> [error] apply(Filter(havingCondition,
> aggregate)).mapChildren(AnalysisBarrier)
> [error]   ^
> [error]
> /Users/jacek/dev/oss/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:1470:
> not found: value AnalysisBarrier
> [error]   case sort @ Sort(sortOrder, global,
> AnalysisBarrier(aggregate: Aggregate)) =>
> [error]   ^
> [error]
> /Users/jacek/dev/oss/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:1471:
> not found: value AnalysisBarrier
> [error] apply(Sort(sortOrder, global,
> aggregate)).mapChildren(AnalysisBarrier)
> [error]   ^
> [error]
> /Users/jacek/dev/oss/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:2345:
> not found: value AnalysisBarrier
> [error] case AnalysisBarrier(child) => child
> [error]  ^
> [error] 8 errors found
> [error] Compile failed at Dec 8, 2017 5:58:10 PM [8.170s]
> [INFO]
> 
>
> Pozdrawiam,
> Jacek Laskowski
> 
> https://about.me/JacekLaskowski
> Spark Structured Streaming https://bit.ly/spark-structured-streaming
> Mastering Apache Spark 2 https://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski
>


Re: Leveraging S3 select

2017-12-08 Thread Andrew Duffy
Hey Steve,

Happen to have a link to the TPC-DS benchmark data w/random S3 reads? I've done 
a decent amount of digging, but all I've found is a reference in a slide deck 
and some jira tickets.

From: Steve Loughran 
Date: Tuesday, December 5, 2017 at 09:44
To: "Lalwani, Jayesh" 
Cc: Apache Spark Dev 
Subject: Re: Leveraging S3 select




On 29 Nov 2017, at 21:45, Lalwani, Jayesh 
> wrote:

AWS announced at re:Invent that they are launching S3 Select. This can allow 
Spark to push down predicates to S3, rather than read the entire file in 
memory. Are there any plans to update Spark to use S3 Select?


  1.  ORC and Parquet don't read the whole file in memory anyway, except in the 
special case that the file is gzipped
  2.  Hadoop's s3a <= 2.7 doesn't handle the aggressive seeks of those columnar 
formats that well, as it does a GET pos-EOF & has to abort the TCP connection 
if the seek is backwards
  3.  Hadoop 2.8+ with spark.hadoop.fs.s3a.experimental.fadvise=random switches 
to random IO and only does smaller GET reads of the data requested (actually 
min(min-read-length, buffer-size). This delivers ~3x performance boost in 
TCP-DS benchmarks


I don't yet know how much more efficient the new mechanism will be against 
columnar data, given those facts. You'd need to do experiments

The place to implement this would be though predicate push down from the file 
format to the FS. ORC & Parquet support predicate pushdown, so they'd need to 
recognise when the underlying store could do some of the work for them, open 
the store input stream differently, and use a whole new (undefined?) API to the 
queries. Most likely: s3a would add a way to specify a predicate to select on 
in open(), as well as the expected file type. This would need the underlying 
mechanism to also support those formats though, which the announcement doesn't/

Someone could do something more immediately though some modified CSV data 
source which did the pushdown. However, If you are using CSV for your datasets, 
there's something fundamental w.r.t your data storage policy you need to look 
at. It works sometimes as an exchange format, though I prefer Avro there due to 
its schemas and support for more complex structures.  As a format you run 
queries over? No.


BUILD FAILURE due to...not found: value AnalysisBarrier in spark-catalyst_2.11?

2017-12-08 Thread Jacek Laskowski
Hi,

Just got BUILD FAILURE and have been wondering if it's just me or is this a
known issue that's being worked on?

(Sorry if that's just my local setup that I got broken)

[INFO] --- scala-maven-plugin:3.2.2:compile (scala-compile-first) @
spark-catalyst_2.11 ---
[INFO] Using zinc server for incremental compilation
[warn] Pruning sources from previous analysis, due to incompatible
CompileSetup.
[info] Compiling 222 Scala sources and 27 Java sources to
/Users/jacek/dev/oss/spark/sql/catalyst/target/scala-2.11/classes...
[error]
/Users/jacek/dev/oss/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:728:
not found: value AnalysisBarrier
[error]   AnalysisBarrier(newRight)
[error]   ^
[error]
/Users/jacek/dev/oss/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:1081:
not found: value AnalysisBarrier
[error]   case sa @ Sort(_, _, AnalysisBarrier(child: Aggregate)) => sa
[error]^
[error]
/Users/jacek/dev/oss/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:1134:
not found: value AnalysisBarrier
[error] return AnalysisBarrier(plan)
[error]^
[error]
/Users/jacek/dev/oss/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:1409:
not found: value AnalysisBarrier
[error]   case filter @ Filter(havingCondition,
AnalysisBarrier(aggregate: Aggregate)) =>
[error] ^
[error]
/Users/jacek/dev/oss/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:1410:
not found: value AnalysisBarrier
[error] apply(Filter(havingCondition,
aggregate)).mapChildren(AnalysisBarrier)
[error]   ^
[error]
/Users/jacek/dev/oss/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:1470:
not found: value AnalysisBarrier
[error]   case sort @ Sort(sortOrder, global,
AnalysisBarrier(aggregate: Aggregate)) =>
[error]   ^
[error]
/Users/jacek/dev/oss/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:1471:
not found: value AnalysisBarrier
[error] apply(Sort(sortOrder, global,
aggregate)).mapChildren(AnalysisBarrier)
[error]   ^
[error]
/Users/jacek/dev/oss/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:2345:
not found: value AnalysisBarrier
[error] case AnalysisBarrier(child) => child
[error]  ^
[error] 8 errors found
[error] Compile failed at Dec 8, 2017 5:58:10 PM [8.170s]
[INFO]


Pozdrawiam,
Jacek Laskowski

https://about.me/JacekLaskowski
Spark Structured Streaming https://bit.ly/spark-structured-streaming
Mastering Apache Spark 2 https://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski


Deprecating UserDefinedGenerator logical operator?

2017-12-08 Thread Jacek Laskowski
Hi,

I've just ran across UserDefinedGenerator logical operator [1] that is used
exclusively for Dataset.explode operator [2][3] that is...

@deprecated("use flatMap() or select() with functions.explode() instead",
"2.0.0")

Could that also "trigger" deprecating UserDefinedGenerator?

[1]
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/generators.scala#L90

[2]
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L2051
[3]
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L2092

Pozdrawiam,
Jacek Laskowski

https://about.me/JacekLaskowski
Spark Structured Streaming https://bit.ly/spark-structured-streaming
Mastering Apache Spark 2 https://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski