Re: [SparkSQL] Count Distinct issue

2018-09-17 Thread kathleen li
Hi,
I can't reproduce your issue:

scala> spark.sql("select distinct * from dfv").show()
++++++++++++++++---+
|   a|   b|   c|   d|   e|   f|   g|   h|   i|   j|   k|   l|   m|   n|
o|  p|
++++++++++++++++---+
|null|null|null|null|null|null|null|null|null|null|null|null|null|null|null|
9|
|null|null|null|null|null|null|null|null|null|null|null|null|null|null|null|
13|
|null|null|null|null|null|null|null|null|null|null|null|null|null|null|null|
2|
|null|null|null|null|null|null|null|null|null|null|null|null|null|null|null|
7|
|null|null|null|null|null|null|null|null|null|null|null|null|null|null|null|
8|
|null|null|null|null|null|null|null|null|null|null|null|null|null|null|null|
3|
|null|null|null|null|null|null|null|null|null|null|null|null|null|null|null|
5|
|null|null|null|null|null|null|null|null|null|null|null|null|null|null|null|
15|
|null|null|null|null|null|null|null|null|null|null|null|null|null|null|null|
12|
|null|null|null|null|null|null|null|null|null|null|null|null|null|null|null|
16|
|null|null|null|null|null|null|null|null|null|null|null|null|null|null|null|
14|
|null|null|null|null|null|null|null|null|null|null|null|null|null|null|null|
4|
|null|null|null|null|null|null|null|null|null|null|null|null|null|null|null|
6|
|null|null|null|null|null|null|null|null|null|null|null|null|null|null|null|
10|
|null|null|null|null|null|null|null|null|null|null|null|null|null|null|null|
11|
|null|null|null|null|null|null|null|null|null|null|null|null|null|null|null|
1|
++++++++++++++++---+


scala> spark.sql("select count(distinct *) from dfv").show()
+--+
|count(DISTINCT a, b, c, d, e, f, g, h, i, j, k, l, m, n, o, p)|
+--+
|16|
+--+
Kathleen

On Fri, Sep 14, 2018 at 11:54 AM Daniele Foroni 
wrote:

> Hi all,
>
> I am having some troubles in doing a count distinct over multiple columns.
> This is an example of my data:
> ++++---+
> |a   |b   |c   |d  |
> ++++---+
> |null|null|null|1  |
> |null|null|null|2  |
> |null|null|null|3  |
> |null|null|null|4  |
> |null|null|null|5  |
> |null|null|null|6  |
> |null|null|null|7  |
> ++++---+
> And my code:
> val df: Dataset[Row] = …
> val cols: List[Column] = df.columns.map(col).toList
> df.agg(countDistinct(cols.head, cols.tail: _*))
>
> So, in the example above, if I count the distinct “rows” I obtain 7 as
> result as expected (since the “d" column changes for every row).
> However, with more columns (16) in EXACTLY the same situation (one
> incremental column and 15 columns filled with nulls) the result is 0.
>
> I don’t understand why I am experiencing this problem.
> Any solution?
>
> Thanks,
> ---
> Daniele
>
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: Subscribe Multiple Topics Structured Streaming

2018-09-17 Thread Sivaprakash
I would like to know how to create stream and sink operations outside
"main" method - just like another class which I can invoke from main. So
that I can have different implementations for each topic which I subscribed
in a specific class file. Is it a good practice or always the whole
implementations should go inside "main" method?

On Mon, Sep 17, 2018 at 11:35 PM naresh Goud 
wrote:

> You can have below statement for multiple topics
>
> val dfStatus = spark.readStream.
>   format("kafka").
>   option("subscribe", "utility-status, utility-critical").
>   option("kafka.bootstrap.servers", "localhost:9092").
>   option("startingOffsets", "earliest")
>   .load()
>
>
>
>
>
> On Mon, Sep 17, 2018 at 3:28 AM sivaprakash <
> sivaprakashshanmu...@gmail.com> wrote:
>
>> Hi
>>
>> I have integrated Spark Streaming with Kafka in which Im listening 2
>> topics
>>
>> def main(args: Array[String]): Unit = {
>>
>> val schema = StructType(
>>   List(
>> StructField("gatewayId", StringType, true),
>> StructField("userId", StringType, true)
>>   )
>> )
>>
>> val spark = SparkSession
>>   .builder
>>   .master("local[4]")
>>   .appName("DeviceAutomation")
>>   .getOrCreate()
>>
>> val dfStatus = spark.readStream.
>>   format("kafka").
>>   option("subscribe", "utility-status, utility-critical").
>>   option("kafka.bootstrap.servers", "localhost:9092").
>>   option("startingOffsets", "earliest")
>>   .load()
>>
>>
>>   }
>>
>> Since I have few more topics to be listed and perform different
>> operations I
>> would like to move each topics into separate case class for better
>> clarity.
>> Is it possible?
>>
>>
>>
>> --
>> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>> --
> Thanks,
> Naresh
> www.linkedin.com/in/naresh-dulam
> http://hadoopandspark.blogspot.com/
>
>

-- 
- Prakash.


Spark FlatMapGroupsWithStateFunction throws cannot resolve 'named_struct()' due to data type mismatch 'SerializeFromObject"

2018-09-17 Thread Kuttaiah Robin
Hello,

Am using FlatMapGroupsWithStateFunction in my spark streaming application.
FlatMapGroupsWithStateFunction
idstateUpdateFunction =
  new FlatMapGroupsWithStateFunction() {.}


SessionUpdate class is having trouble when added the highlighted code which
throws below exception; The same attribute milestones with setter/getter
has been added to SessionInfo (input class)  but it does not throw
exception there.

public static class SessionUpdate implements Serializable {

private static final long serialVersionUID = -3858977319192658483L;

*private ArrayList milestones = new
ArrayList();*

private Timestamp processingTimeoutTimestamp;

public SessionUpdate() {
  super();
}

public SessionUpdate(String instanceId, *ArrayList
milestones*, Timestamp processingTimeoutTimestamp) {
  super();
  this.instanceId = instanceId;
  *this.milestones = milestones;*
  this.processingTimeoutTimestamp = processingTimeoutTimestamp;
}

public String getInstanceId() {
  return instanceId;
}

public void setInstanceId(String instanceId) {
  this.instanceId = instanceId;
}

*public ArrayList getMilestones() {*
*  return milestones;*
*}*

*public void setMilestones(ArrayList milestones) {*
*  this.milestones = milestones;*
*}*

public Timestamp getProcessingTimeoutTimestamp() {
  return processingTimeoutTimestamp;
}

public void setProcessingTimeoutTimestamp(Timestamp
processingTimeoutTimestamp) {
  this.processingTimeoutTimestamp = processingTimeoutTimestamp;
}

}

Exception:
ERROR  cannot resolve 'named_struct()' due to data type mismatch: input to
function named_struct requires at least one argument;;
'SerializeFromObject [staticinvoke(class
org.apache.spark.unsafe.types.UTF8String, StringType, fromString,
assertnotnull(input[0,
oracle.insight.spark.event_processor.EventProcessor$SessionUpdate,
true]).getInstanceId, true, false) AS instanceId#62,
mapobjects(MapObjects_loopValue2, MapObjects_loopIsNull2, ObjectType(class
org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema), if
(isnull(lambdavariable(MapObjects_loopValue2, MapObjects_loopIsNull2,
ObjectType(class
org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema), true)))
null else named_struct(), assertnotnull(input[0,
oracle.insight.spark.event_processor.EventProcessor$SessionUpdate,
true]).getMilestones, None) AS milestones#63, staticinvoke(class
org.apache.spark.sql.catalyst.util.DateTimeUtils$, TimestampType,
fromJavaTimestamp, assertnotnull(input[0,
oracle.insight.spark.event_processor.EventProcessor$SessionUpdate,
true]).getProcessingTimeoutTimestamp, true, false) AS
processingTimeoutTimestamp#64]
+- FlatMapGroupsWithState , cast(value#54 as string).toString,
createexternalrow(EventTime#23.toString, InstanceID#24.toString,
Model#25.toString, Milestone#26.toString, Region#27.toString,
SalesOrganization#28.toString, ProductName#29.toString,
ReasonForQuoteReject#30.toString, ReasonforRejectionBy#31.toString,
OpportunityAmount#32.toJavaBigDecimal, Discount#33.toJavaBigDecimal,
TotalQuoteAmount#34.toJavaBigDecimal, NetQuoteAmount#35.toJavaBigDecimal,
ApprovedDiscount#36.toJavaBigDecimal, TotalOrderAmount#37.toJavaBigDecimal,
StructField(EventTime,StringType,true),
StructField(InstanceID,StringType,true),
StructField(Model,StringType,true), StructField(Milestone,StringType,true),
StructField(Region,StringType,true),
StructField(SalesOrganization,StringType,true),
StructField(ProductName,StringType,true),
StructField(ReasonForQuoteReject,StringType,true),
StructField(ReasonforRejectionBy,StringType,true), ... 6 more fields),
[value#54], [EventTime#23, InstanceID#24, Model#25, Milestone#26,
Region#27, SalesOrganization#28, ProductName#29, ReasonForQuoteReject#30,
ReasonforRejectionBy#31, OpportunityAmount#32, Discount#33,
TotalQuoteAmount#34, NetQuoteAmount#35, ApprovedDiscount#36,
TotalOrderAmount#37], obj#61:
oracle.insight.spark.event_processor.EventProcessor$SessionUpdate,
class[instanceId[0]: string, milestones[0]: array>,
processingTimeoutTimestamp[0]: timestamp], Append, false,
ProcessingTimeTimeout


Schema looks like

{"Name":"EventTime","DataType":"TimestampType"},
{"Name":"InstanceID",   "DataType":"STRING",  "Length":100},
{"Name":"Model","DataType":"STRING",  "Length":100},
{"Name":"Milestone","DataType":"STRING",  "Length":100},
{"Name":"Region",   "DataType":"STRING",  "Length":100},
{"Name":"SalesOrganization","DataType":"STRING",  "Length":100},
{"Name":"ProductName",  "DataType":"STRING",  "Length":100},
{"Name":"ReasonForQuoteReject", "DataType":"STRING",  "Length":100},
{"Name":"ReasonforRejectionBy", "DataType":"STRING",  "Length":100},
//Note: org.apache.spark.sql.types.DataTypes.createDecimalType(precision(),
scale())
{"Name":"OpportunityAmount","DataType":"DECIMAL",
"Precision":38,"Scale":2},
{"Name":"Discount", "DataType":"DECIMAL",
"Precision":38,"Scale":2},
{"Name":"TotalQuoteAmount", "DataType":"DECIMAL",
"Precision":38,"Scale":2},

Re: Metastore problem on Spark2.3 with Hive3.0

2018-09-17 Thread Dongjoon Hyun
Hi, Jerry.

There is a JIRA issue for that,
https://issues.apache.org/jira/browse/SPARK-24360 .

So far, it's in progress for Hive 3.1.0 Metastore for Apache Spark 2.5.0.
You can track that issue there.

Bests,
Dongjoon.


On Mon, Sep 17, 2018 at 7:01 PM 白也诗无敌 <445484...@qq.com> wrote:

> Hi, guys
>   I am using Spark2.3 and I meet the metastore problem.
>   It looks like something about the compatibility cause Spark2.3 still use
> the hive-metastore-1.2.1-spark2.
>   Is there any solution?
>   The Hive metastore version is 3.0 and the stacktrace is below:
>
> org.apache.thrift.TApplicationException: Required field 'filesAdded' is
> unset! Struct:InsertEventRequestData(filesAdded:null)
> at
> org.apache.thrift.TApplicationException.read(TApplicationException.java:111)
> at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:79)
> at
> org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_fire_listener_event(ThriftHiveMetastore.java:4182)
> at
> org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.fire_listener_event(ThriftHiveMetastore.java:4169)
> at
> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.fireListenerEvent(HiveMetaStoreClient.java:1954)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at
> org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:156)
> at com.sun.proxy.$Proxy5.fireListenerEvent(Unknown Source)
> at org.apache.hadoop.hive.ql.metadata.Hive.fireInsertEvent(Hive.java:1947)
> at org.apache.hadoop.hive.ql.metadata.Hive.loadTable(Hive.java:1673)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at
> org.apache.spark.sql.hive.client.Shim_v0_14.loadTable(HiveShim.scala:847)
> at
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$loadTable$1.apply$mcV$sp(HiveClientImpl.scala:757)
> at
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$loadTable$1.apply(HiveClientImpl.scala:757)
> at
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$loadTable$1.apply(HiveClientImpl.scala:757)
> at
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:272)
> at
> org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:210)
> at
> org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:209)
> at
> org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:255)
> at
> org.apache.spark.sql.hive.client.HiveClientImpl.loadTable(HiveClientImpl.scala:756)
> at
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$loadTable$1.apply$mcV$sp(HiveExternalCatalog.scala:829)
> at
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$loadTable$1.apply(HiveExternalCatalog.scala:827)
> at
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$loadTable$1.apply(HiveExternalCatalog.scala:827)
> at
> org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97)
> at
> org.apache.spark.sql.hive.HiveExternalCatalog.loadTable(HiveExternalCatalog.scala:827)
> at
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.loadTable(SessionCatalog.scala:416)
> at
> org.apache.spark.sql.execution.command.LoadDataCommand.run(tables.scala:403)
> at
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
> at
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
> at
> org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79)
> at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:190)
> at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:190)
> at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3253)
> at
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
> at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3252)
> at org.apache.spark.sql.Dataset.(Dataset.scala:190)
> at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:75)
> at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:638)
> at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:694)
> at
> org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:62)
> at
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:355)
> at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:376)
> at
> 

Re: Should python-2 be supported in Spark 3.0?

2018-09-17 Thread Erik Erlandson
I think that makes sense. The main benefit of deprecating *prior* to 3.0
would be informational - making the community aware of the upcoming
transition earlier. But there are other ways to start informing the
community between now and 3.0, besides formal deprecation.

I have some residual curiosity about what it might mean for a release like
2.4 to still be in its support lifetime after Py2 goes EOL. I asked Apache
Legal  to comment. It is
possible there are no issues with this at all.


On Mon, Sep 17, 2018 at 4:26 PM, Reynold Xin  wrote:

> i'd like to second that.
>
> if we want to communicate timeline, we can add to the release notes saying
> py2 will be deprecated in 3.0, and removed in a 3.x release.
>
> --
> excuse the brevity and lower case due to wrist injury
>
>
> On Mon, Sep 17, 2018 at 4:24 PM Matei Zaharia 
> wrote:
>
>> That’s a good point — I’d say there’s just a risk of creating a
>> perception issue. First, some users might feel that this means they have to
>> migrate now, which is before Python itself drops support; they might also
>> be surprised that we did this in a minor release (e.g. might we drop Python
>> 2 altogether in a Spark 2.5 if that later comes out?). Second, contributors
>> might feel that this means new features no longer have to work with Python
>> 2, which would be confusing. Maybe it’s OK on both fronts, but it just
>> seems scarier for users to do this now if we do plan to have Spark 3.0 in
>> the next 6 months anyway.
>>
>> Matei
>>
>> > On Sep 17, 2018, at 1:04 PM, Mark Hamstra 
>> wrote:
>> >
>> > What is the disadvantage to deprecating now in 2.4.0? I mean, it
>> doesn't change the code at all; it's just a notification that we will
>> eventually cease supporting Py2. Wouldn't users prefer to get that
>> notification sooner rather than later?
>> >
>> > On Mon, Sep 17, 2018 at 12:58 PM Matei Zaharia 
>> wrote:
>> > I’d like to understand the maintenance burden of Python 2 before
>> deprecating it. Since it is not EOL yet, it might make sense to only
>> deprecate it once it’s EOL (which is still over a year from now).
>> Supporting Python 2+3 seems less burdensome than supporting, say, multiple
>> Scala versions in the same codebase, so what are we losing out?
>> >
>> > The other thing is that even though Python core devs might not support
>> 2.x later, it’s quite possible that various Linux distros will if moving
>> from 2 to 3 remains painful. In that case, we may want Apache Spark to
>> continue releasing for it despite the Python core devs not supporting it.
>> >
>> > Basically, I’d suggest to deprecate this in Spark 3.0 and then remove
>> it later in 3.x instead of deprecating it in 2.4. I’d also consider looking
>> at what other data science tools are doing before fully removing it: for
>> example, if Pandas and TensorFlow no longer support Python 2 past some
>> point, that might be a good point to remove it.
>> >
>> > Matei
>> >
>> > > On Sep 17, 2018, at 11:01 AM, Mark Hamstra 
>> wrote:
>> > >
>> > > If we're going to do that, then we need to do it right now, since
>> 2.4.0 is already in release candidates.
>> > >
>> > > On Mon, Sep 17, 2018 at 10:57 AM Erik Erlandson 
>> wrote:
>> > > I like Mark’s concept for deprecating Py2 starting with 2.4: It may
>> seem like a ways off but even now there may be some spark versions
>> supporting Py2 past the point where Py2 is no longer receiving security
>> patches
>> > >
>> > >
>> > > On Sun, Sep 16, 2018 at 12:26 PM Mark Hamstra <
>> m...@clearstorydata.com> wrote:
>> > > We could also deprecate Py2 already in the 2.4.0 release.
>> > >
>> > > On Sat, Sep 15, 2018 at 11:46 AM Erik Erlandson 
>> wrote:
>> > > In case this didn't make it onto this thread:
>> > >
>> > > There is a 3rd option, which is to deprecate Py2 for Spark-3.0, and
>> remove it entirely on a later 3.x release.
>> > >
>> > > On Sat, Sep 15, 2018 at 11:09 AM, Erik Erlandson 
>> wrote:
>> > > On a separate dev@spark thread, I raised a question of whether or
>> not to support python 2 in Apache Spark, going forward into Spark 3.0.
>> > >
>> > > Python-2 is going EOL at the end of 2019. The upcoming release of
>> Spark 3.0 is an opportunity to make breaking changes to Spark's APIs, and
>> so it is a good time to consider support for Python-2 on PySpark.
>> > >
>> > > Key advantages to dropping Python 2 are:
>> > >   • Support for PySpark becomes significantly easier.
>> > >   • Avoid having to support Python 2 until Spark 4.0, which is
>> likely to imply supporting Python 2 for some time after it goes EOL.
>> > > (Note that supporting python 2 after EOL means, among other things,
>> that PySpark would be supporting a version of python that was no longer
>> receiving security patches)
>> > >
>> > > The main disadvantage is that PySpark users who have legacy python-2
>> code would have to migrate their code to python 3 to take advantage of
>> Spark 3.0
>> > >
>> > > This decision obviously has 

Re: Should python-2 be supported in Spark 3.0?

2018-09-17 Thread Reynold Xin
i'd like to second that.

if we want to communicate timeline, we can add to the release notes saying
py2 will be deprecated in 3.0, and removed in a 3.x release.

--
excuse the brevity and lower case due to wrist injury


On Mon, Sep 17, 2018 at 4:24 PM Matei Zaharia 
wrote:

> That’s a good point — I’d say there’s just a risk of creating a perception
> issue. First, some users might feel that this means they have to migrate
> now, which is before Python itself drops support; they might also be
> surprised that we did this in a minor release (e.g. might we drop Python 2
> altogether in a Spark 2.5 if that later comes out?). Second, contributors
> might feel that this means new features no longer have to work with Python
> 2, which would be confusing. Maybe it’s OK on both fronts, but it just
> seems scarier for users to do this now if we do plan to have Spark 3.0 in
> the next 6 months anyway.
>
> Matei
>
> > On Sep 17, 2018, at 1:04 PM, Mark Hamstra 
> wrote:
> >
> > What is the disadvantage to deprecating now in 2.4.0? I mean, it doesn't
> change the code at all; it's just a notification that we will eventually
> cease supporting Py2. Wouldn't users prefer to get that notification sooner
> rather than later?
> >
> > On Mon, Sep 17, 2018 at 12:58 PM Matei Zaharia 
> wrote:
> > I’d like to understand the maintenance burden of Python 2 before
> deprecating it. Since it is not EOL yet, it might make sense to only
> deprecate it once it’s EOL (which is still over a year from now).
> Supporting Python 2+3 seems less burdensome than supporting, say, multiple
> Scala versions in the same codebase, so what are we losing out?
> >
> > The other thing is that even though Python core devs might not support
> 2.x later, it’s quite possible that various Linux distros will if moving
> from 2 to 3 remains painful. In that case, we may want Apache Spark to
> continue releasing for it despite the Python core devs not supporting it.
> >
> > Basically, I’d suggest to deprecate this in Spark 3.0 and then remove it
> later in 3.x instead of deprecating it in 2.4. I’d also consider looking at
> what other data science tools are doing before fully removing it: for
> example, if Pandas and TensorFlow no longer support Python 2 past some
> point, that might be a good point to remove it.
> >
> > Matei
> >
> > > On Sep 17, 2018, at 11:01 AM, Mark Hamstra 
> wrote:
> > >
> > > If we're going to do that, then we need to do it right now, since
> 2.4.0 is already in release candidates.
> > >
> > > On Mon, Sep 17, 2018 at 10:57 AM Erik Erlandson 
> wrote:
> > > I like Mark’s concept for deprecating Py2 starting with 2.4: It may
> seem like a ways off but even now there may be some spark versions
> supporting Py2 past the point where Py2 is no longer receiving security
> patches
> > >
> > >
> > > On Sun, Sep 16, 2018 at 12:26 PM Mark Hamstra 
> wrote:
> > > We could also deprecate Py2 already in the 2.4.0 release.
> > >
> > > On Sat, Sep 15, 2018 at 11:46 AM Erik Erlandson 
> wrote:
> > > In case this didn't make it onto this thread:
> > >
> > > There is a 3rd option, which is to deprecate Py2 for Spark-3.0, and
> remove it entirely on a later 3.x release.
> > >
> > > On Sat, Sep 15, 2018 at 11:09 AM, Erik Erlandson 
> wrote:
> > > On a separate dev@spark thread, I raised a question of whether or not
> to support python 2 in Apache Spark, going forward into Spark 3.0.
> > >
> > > Python-2 is going EOL at the end of 2019. The upcoming release of
> Spark 3.0 is an opportunity to make breaking changes to Spark's APIs, and
> so it is a good time to consider support for Python-2 on PySpark.
> > >
> > > Key advantages to dropping Python 2 are:
> > >   • Support for PySpark becomes significantly easier.
> > >   • Avoid having to support Python 2 until Spark 4.0, which is
> likely to imply supporting Python 2 for some time after it goes EOL.
> > > (Note that supporting python 2 after EOL means, among other things,
> that PySpark would be supporting a version of python that was no longer
> receiving security patches)
> > >
> > > The main disadvantage is that PySpark users who have legacy python-2
> code would have to migrate their code to python 3 to take advantage of
> Spark 3.0
> > >
> > > This decision obviously has large implications for the Apache Spark
> community and we want to solicit community feedback.
> > >
> > >
> >
>
>


Re: Should python-2 be supported in Spark 3.0?

2018-09-17 Thread Matei Zaharia
That’s a good point — I’d say there’s just a risk of creating a perception 
issue. First, some users might feel that this means they have to migrate now, 
which is before Python itself drops support; they might also be surprised that 
we did this in a minor release (e.g. might we drop Python 2 altogether in a 
Spark 2.5 if that later comes out?). Second, contributors might feel that this 
means new features no longer have to work with Python 2, which would be 
confusing. Maybe it’s OK on both fronts, but it just seems scarier for users to 
do this now if we do plan to have Spark 3.0 in the next 6 months anyway.

Matei

> On Sep 17, 2018, at 1:04 PM, Mark Hamstra  wrote:
> 
> What is the disadvantage to deprecating now in 2.4.0? I mean, it doesn't 
> change the code at all; it's just a notification that we will eventually 
> cease supporting Py2. Wouldn't users prefer to get that notification sooner 
> rather than later?
> 
> On Mon, Sep 17, 2018 at 12:58 PM Matei Zaharia  
> wrote:
> I’d like to understand the maintenance burden of Python 2 before deprecating 
> it. Since it is not EOL yet, it might make sense to only deprecate it once 
> it’s EOL (which is still over a year from now). Supporting Python 2+3 seems 
> less burdensome than supporting, say, multiple Scala versions in the same 
> codebase, so what are we losing out?
> 
> The other thing is that even though Python core devs might not support 2.x 
> later, it’s quite possible that various Linux distros will if moving from 2 
> to 3 remains painful. In that case, we may want Apache Spark to continue 
> releasing for it despite the Python core devs not supporting it.
> 
> Basically, I’d suggest to deprecate this in Spark 3.0 and then remove it 
> later in 3.x instead of deprecating it in 2.4. I’d also consider looking at 
> what other data science tools are doing before fully removing it: for 
> example, if Pandas and TensorFlow no longer support Python 2 past some point, 
> that might be a good point to remove it.
> 
> Matei
> 
> > On Sep 17, 2018, at 11:01 AM, Mark Hamstra  wrote:
> > 
> > If we're going to do that, then we need to do it right now, since 2.4.0 is 
> > already in release candidates.
> > 
> > On Mon, Sep 17, 2018 at 10:57 AM Erik Erlandson  wrote:
> > I like Mark’s concept for deprecating Py2 starting with 2.4: It may seem 
> > like a ways off but even now there may be some spark versions supporting 
> > Py2 past the point where Py2 is no longer receiving security patches 
> > 
> > 
> > On Sun, Sep 16, 2018 at 12:26 PM Mark Hamstra  
> > wrote:
> > We could also deprecate Py2 already in the 2.4.0 release.
> > 
> > On Sat, Sep 15, 2018 at 11:46 AM Erik Erlandson  wrote:
> > In case this didn't make it onto this thread:
> > 
> > There is a 3rd option, which is to deprecate Py2 for Spark-3.0, and remove 
> > it entirely on a later 3.x release.
> > 
> > On Sat, Sep 15, 2018 at 11:09 AM, Erik Erlandson  
> > wrote:
> > On a separate dev@spark thread, I raised a question of whether or not to 
> > support python 2 in Apache Spark, going forward into Spark 3.0.
> > 
> > Python-2 is going EOL at the end of 2019. The upcoming release of Spark 3.0 
> > is an opportunity to make breaking changes to Spark's APIs, and so it is a 
> > good time to consider support for Python-2 on PySpark.
> > 
> > Key advantages to dropping Python 2 are:
> >   • Support for PySpark becomes significantly easier.
> >   • Avoid having to support Python 2 until Spark 4.0, which is likely 
> > to imply supporting Python 2 for some time after it goes EOL.
> > (Note that supporting python 2 after EOL means, among other things, that 
> > PySpark would be supporting a version of python that was no longer 
> > receiving security patches)
> > 
> > The main disadvantage is that PySpark users who have legacy python-2 code 
> > would have to migrate their code to python 3 to take advantage of Spark 3.0
> > 
> > This decision obviously has large implications for the Apache Spark 
> > community and we want to solicit community feedback.
> > 
> > 
> 


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Should python-2 be supported in Spark 3.0?

2018-09-17 Thread Erik Erlandson
FWIW, Pandas is dropping

Py2 support at the end of this year.  Tensorflow is less clear. They only
support py3 on windows, but there is no reference to any policy about py2
on their roadmap or the TF 2.0 announcement.


Re: Should python-2 be supported in Spark 3.0?

2018-09-17 Thread Mark Hamstra
What is the disadvantage to deprecating now in 2.4.0? I mean, it doesn't
change the code at all; it's just a notification that we will eventually
cease supporting Py2. Wouldn't users prefer to get that notification sooner
rather than later?

On Mon, Sep 17, 2018 at 12:58 PM Matei Zaharia 
wrote:

> I’d like to understand the maintenance burden of Python 2 before
> deprecating it. Since it is not EOL yet, it might make sense to only
> deprecate it once it’s EOL (which is still over a year from now).
> Supporting Python 2+3 seems less burdensome than supporting, say, multiple
> Scala versions in the same codebase, so what are we losing out?
>
> The other thing is that even though Python core devs might not support 2.x
> later, it’s quite possible that various Linux distros will if moving from 2
> to 3 remains painful. In that case, we may want Apache Spark to continue
> releasing for it despite the Python core devs not supporting it.
>
> Basically, I’d suggest to deprecate this in Spark 3.0 and then remove it
> later in 3.x instead of deprecating it in 2.4. I’d also consider looking at
> what other data science tools are doing before fully removing it: for
> example, if Pandas and TensorFlow no longer support Python 2 past some
> point, that might be a good point to remove it.
>
> Matei
>
> > On Sep 17, 2018, at 11:01 AM, Mark Hamstra 
> wrote:
> >
> > If we're going to do that, then we need to do it right now, since 2.4.0
> is already in release candidates.
> >
> > On Mon, Sep 17, 2018 at 10:57 AM Erik Erlandson 
> wrote:
> > I like Mark’s concept for deprecating Py2 starting with 2.4: It may seem
> like a ways off but even now there may be some spark versions supporting
> Py2 past the point where Py2 is no longer receiving security patches
> >
> >
> > On Sun, Sep 16, 2018 at 12:26 PM Mark Hamstra 
> wrote:
> > We could also deprecate Py2 already in the 2.4.0 release.
> >
> > On Sat, Sep 15, 2018 at 11:46 AM Erik Erlandson 
> wrote:
> > In case this didn't make it onto this thread:
> >
> > There is a 3rd option, which is to deprecate Py2 for Spark-3.0, and
> remove it entirely on a later 3.x release.
> >
> > On Sat, Sep 15, 2018 at 11:09 AM, Erik Erlandson 
> wrote:
> > On a separate dev@spark thread, I raised a question of whether or not
> to support python 2 in Apache Spark, going forward into Spark 3.0.
> >
> > Python-2 is going EOL at the end of 2019. The upcoming release of Spark
> 3.0 is an opportunity to make breaking changes to Spark's APIs, and so it
> is a good time to consider support for Python-2 on PySpark.
> >
> > Key advantages to dropping Python 2 are:
> >   • Support for PySpark becomes significantly easier.
> >   • Avoid having to support Python 2 until Spark 4.0, which is
> likely to imply supporting Python 2 for some time after it goes EOL.
> > (Note that supporting python 2 after EOL means, among other things, that
> PySpark would be supporting a version of python that was no longer
> receiving security patches)
> >
> > The main disadvantage is that PySpark users who have legacy python-2
> code would have to migrate their code to python 3 to take advantage of
> Spark 3.0
> >
> > This decision obviously has large implications for the Apache Spark
> community and we want to solicit community feedback.
> >
> >
>
>


why display this error

2018-09-17 Thread hager
I run this code using spark-submit --jars
spark-streaming-kafka-0-8-assembly_2.10-2.0.0-preview.jar kafka2.py
localhost:9092 test


import sys

from pyspark import SparkContext, SparkConf
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
from uuid import uuid1

if __name__ == "__main__":
sc = SparkContext(appName="PythonStreamingRecieverKafkaWordCount")
ssc = StreamingContext(sc, 2) # 2 second window

broker, topic = sys.argv[1:]
kvs = KafkaUtils.createStream(ssc, \
  broker, \
  "raw-event-streaming-consumer",{topic:1}) 

lines = kvs.map(lambda x: x[1])
counts = lines.flatMap(lambda line: line.split(" ")).map(lambda word:
(word, 1)).reduceByKey(lambda a, b: a+b)

counts.pprint()
ssc.start()
ssc.awaitTermination()


*why display this error *
2018-09-17 11:49:19 ERROR ReceiverTracker:91 - Receiver has been stopped.
Try to restart it.
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in
stage 81.0 failed 1 times, most recent failure: Lost task 0.0 in stage 81.0
(TID 81, localhost, executor driver): java.lang.AbstractMethodError
at
org.apache.spark.internal.Logging$class.initializeLogIfNecessary(Logging.scala:99)



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Should python-2 be supported in Spark 3.0?

2018-09-17 Thread Matei Zaharia
I’d like to understand the maintenance burden of Python 2 before deprecating 
it. Since it is not EOL yet, it might make sense to only deprecate it once it’s 
EOL (which is still over a year from now). Supporting Python 2+3 seems less 
burdensome than supporting, say, multiple Scala versions in the same codebase, 
so what are we losing out?

The other thing is that even though Python core devs might not support 2.x 
later, it’s quite possible that various Linux distros will if moving from 2 to 
3 remains painful. In that case, we may want Apache Spark to continue releasing 
for it despite the Python core devs not supporting it.

Basically, I’d suggest to deprecate this in Spark 3.0 and then remove it later 
in 3.x instead of deprecating it in 2.4. I’d also consider looking at what 
other data science tools are doing before fully removing it: for example, if 
Pandas and TensorFlow no longer support Python 2 past some point, that might be 
a good point to remove it.

Matei

> On Sep 17, 2018, at 11:01 AM, Mark Hamstra  wrote:
> 
> If we're going to do that, then we need to do it right now, since 2.4.0 is 
> already in release candidates.
> 
> On Mon, Sep 17, 2018 at 10:57 AM Erik Erlandson  wrote:
> I like Mark’s concept for deprecating Py2 starting with 2.4: It may seem like 
> a ways off but even now there may be some spark versions supporting Py2 past 
> the point where Py2 is no longer receiving security patches 
> 
> 
> On Sun, Sep 16, 2018 at 12:26 PM Mark Hamstra  wrote:
> We could also deprecate Py2 already in the 2.4.0 release.
> 
> On Sat, Sep 15, 2018 at 11:46 AM Erik Erlandson  wrote:
> In case this didn't make it onto this thread:
> 
> There is a 3rd option, which is to deprecate Py2 for Spark-3.0, and remove it 
> entirely on a later 3.x release.
> 
> On Sat, Sep 15, 2018 at 11:09 AM, Erik Erlandson  wrote:
> On a separate dev@spark thread, I raised a question of whether or not to 
> support python 2 in Apache Spark, going forward into Spark 3.0.
> 
> Python-2 is going EOL at the end of 2019. The upcoming release of Spark 3.0 
> is an opportunity to make breaking changes to Spark's APIs, and so it is a 
> good time to consider support for Python-2 on PySpark.
> 
> Key advantages to dropping Python 2 are:
>   • Support for PySpark becomes significantly easier.
>   • Avoid having to support Python 2 until Spark 4.0, which is likely to 
> imply supporting Python 2 for some time after it goes EOL.
> (Note that supporting python 2 after EOL means, among other things, that 
> PySpark would be supporting a version of python that was no longer receiving 
> security patches)
> 
> The main disadvantage is that PySpark users who have legacy python-2 code 
> would have to migrate their code to python 3 to take advantage of Spark 3.0
> 
> This decision obviously has large implications for the Apache Spark community 
> and we want to solicit community feedback.
> 
> 


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Subscribe Multiple Topics Structured Streaming

2018-09-17 Thread naresh Goud
You can have below statement for multiple topics

val dfStatus = spark.readStream.
  format("kafka").
  option("subscribe", "utility-status, utility-critical").
  option("kafka.bootstrap.servers", "localhost:9092").
  option("startingOffsets", "earliest")
  .load()





On Mon, Sep 17, 2018 at 3:28 AM sivaprakash 
wrote:

> Hi
>
> I have integrated Spark Streaming with Kafka in which Im listening 2 topics
>
> def main(args: Array[String]): Unit = {
>
> val schema = StructType(
>   List(
> StructField("gatewayId", StringType, true),
> StructField("userId", StringType, true)
>   )
> )
>
> val spark = SparkSession
>   .builder
>   .master("local[4]")
>   .appName("DeviceAutomation")
>   .getOrCreate()
>
> val dfStatus = spark.readStream.
>   format("kafka").
>   option("subscribe", "utility-status, utility-critical").
>   option("kafka.bootstrap.servers", "localhost:9092").
>   option("startingOffsets", "earliest")
>   .load()
>
>
>   }
>
> Since I have few more topics to be listed and perform different operations
> I
> would like to move each topics into separate case class for better clarity.
> Is it possible?
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
> --
Thanks,
Naresh
www.linkedin.com/in/naresh-dulam
http://hadoopandspark.blogspot.com/


Re: Should python-2 be supported in Spark 3.0?

2018-09-17 Thread Mark Hamstra
If we're going to do that, then we need to do it right now, since 2.4.0 is
already in release candidates.

On Mon, Sep 17, 2018 at 10:57 AM Erik Erlandson  wrote:

> I like Mark’s concept for deprecating Py2 starting with 2.4: It may seem
> like a ways off but even now there may be some spark versions supporting
> Py2 past the point where Py2 is no longer receiving security patches
>
>
> On Sun, Sep 16, 2018 at 12:26 PM Mark Hamstra 
> wrote:
>
>> We could also deprecate Py2 already in the 2.4.0 release.
>>
>> On Sat, Sep 15, 2018 at 11:46 AM Erik Erlandson 
>> wrote:
>>
>>> In case this didn't make it onto this thread:
>>>
>>> There is a 3rd option, which is to deprecate Py2 for Spark-3.0, and
>>> remove it entirely on a later 3.x release.
>>>
>>> On Sat, Sep 15, 2018 at 11:09 AM, Erik Erlandson 
>>> wrote:
>>>
 On a separate dev@spark thread, I raised a question of whether or not
 to support python 2 in Apache Spark, going forward into Spark 3.0.

 Python-2 is going EOL  at
 the end of 2019. The upcoming release of Spark 3.0 is an opportunity to
 make breaking changes to Spark's APIs, and so it is a good time to consider
 support for Python-2 on PySpark.

 Key advantages to dropping Python 2 are:

- Support for PySpark becomes significantly easier.
- Avoid having to support Python 2 until Spark 4.0, which is likely
to imply supporting Python 2 for some time after it goes EOL.

 (Note that supporting python 2 after EOL means, among other things,
 that PySpark would be supporting a version of python that was no longer
 receiving security patches)

 The main disadvantage is that PySpark users who have legacy python-2
 code would have to migrate their code to python 3 to take advantage of
 Spark 3.0

 This decision obviously has large implications for the Apache Spark
 community and we want to solicit community feedback.


>>>


Re: Should python-2 be supported in Spark 3.0?

2018-09-17 Thread Erik Erlandson
I like Mark’s concept for deprecating Py2 starting with 2.4: It may seem
like a ways off but even now there may be some spark versions supporting
Py2 past the point where Py2 is no longer receiving security patches


On Sun, Sep 16, 2018 at 12:26 PM Mark Hamstra 
wrote:

> We could also deprecate Py2 already in the 2.4.0 release.
>
> On Sat, Sep 15, 2018 at 11:46 AM Erik Erlandson 
> wrote:
>
>> In case this didn't make it onto this thread:
>>
>> There is a 3rd option, which is to deprecate Py2 for Spark-3.0, and
>> remove it entirely on a later 3.x release.
>>
>> On Sat, Sep 15, 2018 at 11:09 AM, Erik Erlandson 
>> wrote:
>>
>>> On a separate dev@spark thread, I raised a question of whether or not
>>> to support python 2 in Apache Spark, going forward into Spark 3.0.
>>>
>>> Python-2 is going EOL  at
>>> the end of 2019. The upcoming release of Spark 3.0 is an opportunity to
>>> make breaking changes to Spark's APIs, and so it is a good time to consider
>>> support for Python-2 on PySpark.
>>>
>>> Key advantages to dropping Python 2 are:
>>>
>>>- Support for PySpark becomes significantly easier.
>>>- Avoid having to support Python 2 until Spark 4.0, which is likely
>>>to imply supporting Python 2 for some time after it goes EOL.
>>>
>>> (Note that supporting python 2 after EOL means, among other things, that
>>> PySpark would be supporting a version of python that was no longer
>>> receiving security patches)
>>>
>>> The main disadvantage is that PySpark users who have legacy python-2
>>> code would have to migrate their code to python 3 to take advantage of
>>> Spark 3.0
>>>
>>> This decision obviously has large implications for the Apache Spark
>>> community and we want to solicit community feedback.
>>>
>>>
>>


Subscribe Multiple Topics Structured Streaming

2018-09-17 Thread sivaprakash
Hi

I have integrated Spark Streaming with Kafka in which Im listening 2 topics

def main(args: Array[String]): Unit = {

val schema = StructType(
  List(
StructField("gatewayId", StringType, true),
StructField("userId", StringType, true)
  )
)

val spark = SparkSession
  .builder
  .master("local[4]")
  .appName("DeviceAutomation")
  .getOrCreate()

val dfStatus = spark.readStream.
  format("kafka").
  option("subscribe", "utility-status, utility-critical").
  option("kafka.bootstrap.servers", "localhost:9092").
  option("startingOffsets", "earliest")
  .load()

  
  }
  
Since I have few more topics to be listed and perform different operations I
would like to move each topics into separate case class for better clarity.
Is it possible? 



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: What is the best way for Spark to read HDF5@scale?

2018-09-17 Thread Saurav Sinha
Here is the solution

sc.textFile("hdfs://nn1home:8020/input/war-and-peace.txt")

How did I find out nn1home:8020?

Just search for the file core-site.xml and look for xml element fs.defaultFS

On Fri, Sep 14, 2018 at 7:57 PM kathleen li  wrote:

> Hi,
> Any Spark-connector for HDF5?
>
> The following link does not work anymore?
>
> https://www.hdfgroup.org/downloads/spark-connector/
> down vo
>
> Thanks,
>
> Kathleen
>


-- 
Thanks and Regards,

Saurav Sinha

Contact: 9742879062


Re: Best practices on how to multiple spark sessions

2018-09-17 Thread Venkat Ramakrishnan
Umesh,

I found the following write-up dealing with architecture and memory
considerations elaborately. There are updates on memory, but it would
be a good start for you:
https://0x0fff.com/spark-architecture/

Any additional source(s) of info. are welcome from others too.

- Venkat.

On Sun, Sep 16, 2018 at 11:45 PM unk1102  wrote:

> Hi I have application which servers as ETL job and I have hundreds of such
> ETL jobs which runs daily now as of now I have just one spark session which
> is shared by all these jobs and sometimes all of these jobs run at the same
> time causing spark session to die due memory issues mostly. Is this a good
> design? I am thinking to create multiple spark sessions possibly one spark
> session for each ETL job but there is delay in starting spark session which
> seems to multiple by no of ETL jobs. Please share best practices and
> designs
> for such problems. Thanks in advance.
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>