[Announcement] Analytics Zoo 0.5 release

2019-06-17 Thread Jason Dai
Hi all,


We are happy to announce the 0.5 release of Analytics Zoo
, a unified Analytics +
AI platform for *distributed TensorFlow, Keras & BigDL on Apache Spark*;
some of the notable new features in this release are:


   - tfpark
   :
   distributed TensorFlow for Apache Spark
  - Support Keras
   and
  TensorFlow Estimator
  
  APIs
  - Built-in NLP models
  
  (including NER, Intent Extraction and POS Tagging)
  - Premade BERT estimators
  
  (e.g., BERT Classifier)


   - Model Inference improvement, including
  - Int8
  

(DL
  Boost/VNNI) inference support based on BigDL and OpenVINO
  - Distributed, streaming inference
  

  using Spark Streaming and Apache Flink


   - Additional built-in models for BigDL, including fine-tuning pipeline
   for SSD
   
,
   Transformer
   
,
   BERT
   
,
   etc.


   - Support for BigDL 0.8.0; please see the download page
    for the
   supported versions and releases.

 For more details, you may refer to the project website at
https://github.com/intel-analytics/analytics-zoo/



Thanks,

-Jason


Re: A basic question

2019-06-17 Thread Shyam P
Thank you so much Deepak.
Let me implement and update you. Hope it works.

Any short-comings I need to consider or take care of ?

Regards,
Shyam

On Mon, Jun 17, 2019 at 12:39 PM Deepak Sharma 
wrote:

> You can follow this example:
>
> https://docs.spring.io/spring-hadoop/docs/current/reference/html/springandhadoop-spark.html
>
>
> On Mon, Jun 17, 2019 at 12:27 PM Shyam P  wrote:
>
>> I am developing a spark job using java1.8v.
>>
>> Is it possible to write a spark app using spring-boot technology?
>> Did anyone tried it ? if so how it should be done?
>>
>>
>> Regards,
>> Shyam
>>
>
>
> --
> Thanks
> Deepak
> www.bigdatabig.com
> www.keosha.net
>


Re: Filter cannot be pushed via a Join

2019-06-17 Thread Gourav Sengupta
Hi,
Can I ask which is the version of SPARK you are using? And in what
environment?

Regards,
Gourav

On Fri, Jun 14, 2019 at 5:14 PM William Wong  wrote:

> Dear all,
>
> I created two tables.
>
> scala> spark.sql("CREATE TABLE IF NOT EXISTS table1(id string, val string)
> USING PARQUET");
> 19/06/14 23:49:10 WARN ObjectStore: Version information not found in
> metastore. hive.metastore.schema.verification is not enabled so recording
> the schema version 1.2.0
> 19/06/14 23:49:11 WARN ObjectStore: Failed to get database default,
> returning NoSuchObjectException
> res1: org.apache.spark.sql.DataFrame = []
>
> scala> spark.sql("CREATE TABLE IF NOT EXISTS table2(id string, val string)
> USING PARQUET");
> res2: org.apache.spark.sql.DataFrame = []
>
>
> It is the plan of joining these two column via ID column. It looks good to
> me as the filter 'id ='a'' is pushed to both tables as expected.
>
> scala> spark.sql("SELECT * FROM table2 t1, table2 t2 WHERE t1.id = t2.id
> AND t1.id ='a'").explain
> == Physical Plan ==
> *(2) BroadcastHashJoin [id#23], [id#68], Inner, BuildRight
> :- *(2) Project [id#23, val#24]
> :  +- *(2) Filter (isnotnull(id#23) && (id#23 = a))
> : +- *(2) FileScan parquet default.table2[id#23,val#24] Batched: true,
> Format: Parquet, Location:
> InMemoryFileIndex[file:/Users/williamwong/spark-warehouse/table2], 
> *PartitionFilters:
> [], PushedFilters: [IsNotNull(id), EqualTo(id,a)],* ReadSchema:
> struct
> +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, string,
> true]))
>+- *(1) Project [id#68, val#69]
>   +- *(1) Filter ((id#68 = a) && isnotnull(id#68))
>  +- *(1) FileScan parquet default.table2[id#68,val#69] Batched:
> true, Format: Parquet, Location:
> InMemoryFileIndex[file:/Users/williamwong/spark-warehouse/table2], 
> *PartitionFilters:
> [], PushedFilters: [EqualTo(id,a), IsNotNull(id)],* ReadSchema:
> struct
>
>
> Somehow, we created a view on table1 by union a few partitions like this:
>
> scala> spark.sql("""
>  | CREATE VIEW partitioned_table_1 AS
>  | SELECT * FROM table1 WHERE id = 'a'
>  | UNION ALL
>  | SELECT * FROM table1 WHERE id = 'b'
>  | UNION ALL
>  | SELECT * FROM table1 WHERE id = 'c'
>  | UNION ALL
>  | SELECT * FROM table1 WHERE id NOT IN ('a','b','c')
>  | """.stripMargin)
> res7: org.apache.spark.sql.DataFrame = []
>
>
> In theory, selecting data via this view 'partitioned_table_1' should be
> the same as via the table 'table1'
>
> This query also can push the filter 'id IN ('a','b','c','d') to table2 as
> expected.
>
> scala> spark.sql("SELECT * FROM partitioned_table_1 t1, table2 t2 WHERE
> t1.id = t2.id AND t1.id IN ('a','b','c','d')").explain
> == Physical Plan ==
> *(6) BroadcastHashJoin [id#0], [id#23], Inner, BuildRight
> :- Union
> :  :- *(1) Project [id#0, val#1]
> :  :  +- *(1) Filter ((isnotnull(id#0) && (id#0 = a)) && id#0 IN (a,b,c,d))
> :  : +- *(1) FileScan parquet default.table1[id#0,val#1] Batched:
> true, Format: Parquet, Location:
> InMemoryFileIndex[file:/Users/williamwong/spark-warehouse/table1],
> PartitionFilters: [], PushedFilters: [IsNotNull(id), EqualTo(id,a), In(id,
> [a,b,c,d])], ReadSchema: struct
> :  :- *(2) Project [id#0, val#1]
> :  :  +- *(2) Filter ((isnotnull(id#0) && (id#0 = b)) && id#0 IN (a,b,c,d))
> :  : +- *(2) FileScan parquet default.table1[id#0,val#1] Batched:
> true, Format: Parquet, Location:
> InMemoryFileIndex[file:/Users/williamwong/spark-warehouse/table1],
> PartitionFilters: [], PushedFilters: [IsNotNull(id), EqualTo(id,b), In(id,
> [a,b,c,d])], ReadSchema: struct
> :  :- *(3) Project [id#0, val#1]
> :  :  +- *(3) Filter ((isnotnull(id#0) && (id#0 = c)) && id#0 IN (a,b,c,d))
> :  : +- *(3) FileScan parquet default.table1[id#0,val#1] Batched:
> true, Format: Parquet, Location:
> InMemoryFileIndex[file:/Users/williamwong/spark-warehouse/table1],
> PartitionFilters: [], PushedFilters: [IsNotNull(id), EqualTo(id,c), In(id,
> [a,b,c,d])], ReadSchema: struct
> :  +- *(4) Project [id#0, val#1]
> : +- *(4) Filter ((NOT id#0 IN (a,b,c) && id#0 IN (a,b,c,d)) &&
> isnotnull(id#0))
> :+- *(4) FileScan parquet default.table1[id#0,val#1] Batched:
> true, Format: Parquet, Location:
> InMemoryFileIndex[file:/Users/williamwong/spark-warehouse/table1],
> PartitionFilters: [], PushedFilters: [Not(In(id, [a,b,c])), In(id,
> [a,b,c,d]), IsNotNull(id)], ReadSchema: struct
> +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, string,
> true]))
>+- *(5) Project [id#23, val#24]
>   +- *(5) Filter ((id#23 IN (a,b,c,d) && ((isnotnull(id#23) &&
> (((id#23 = a) || (id#23 = b)) || (id#23 = c))) || NOT id#23 IN (a,b,c))) &&
> isnotnull(id#23))
>  +- *(5) FileScan parquet default.table2[id#23,val#24] Batched:
> true, Format: Parquet, Location:
> InMemoryFileIndex[file:/Users/williamwong/spark-warehouse/table2],
> PartitionFilters: [], *PushedFilters: [In(id, [a,b,c,d]),
> 

Re: Exposing JIRA issue types at GitHub PRs

2019-06-17 Thread Gabor Somogyi
Dongjoon, I think it's useful. Thanks for adding it!

On Mon, Jun 17, 2019 at 8:05 AM Dongjoon Hyun 
wrote:

> Thank you, Hyukjin !
>
> On Sun, Jun 16, 2019 at 4:12 PM Hyukjin Kwon  wrote:
>
>> Labels look good and useful.
>>
>> On Sat, 15 Jun 2019, 02:36 Dongjoon Hyun, 
>> wrote:
>>
>>> Now, you can see the exposed component labels (ordered by the number of
>>> PRs) here and click the component to search.
>>>
>>> https://github.com/apache/spark/labels?sort=count-desc
>>>
>>> Dongjoon.
>>>
>>>
>>> On Fri, Jun 14, 2019 at 1:15 AM Dongjoon Hyun 
>>> wrote:
>>>
 Hi, All.

 JIRA and PR is ready for reviews.

 https://issues.apache.org/jira/browse/SPARK-28051 (Exposing JIRA issue
 component types at GitHub PRs)
 https://github.com/apache/spark/pull/24871

 Bests,
 Dongjoon.


 On Thu, Jun 13, 2019 at 10:48 AM Dongjoon Hyun 
 wrote:

> Thank you for the feedbacks and requirements, Hyukjin, Reynold, Marco.
>
> Sure, we can do whatever we want.
>
> I'll wait for more feedbacks and proceed to the next steps.
>
> Bests,
> Dongjoon.
>
>
> On Wed, Jun 12, 2019 at 11:51 PM Marco Gaido 
> wrote:
>
>> Hi Dongjoon,
>> Thanks for the proposal! I like the idea. Maybe we can extend it to
>> component too and to some jira labels such as correctness which may be
>> worth to highlight in PRs too. My only concern is that in many cases 
>> JIRAs
>> are created not very carefully so they may be incorrect at the moment of
>> the pr creation and it may be updated later: so keeping them in sync may 
>> be
>> an extra effort..
>>
>> On Thu, 13 Jun 2019, 08:09 Reynold Xin,  wrote:
>>
>>> Seems like a good idea. Can we test this with a component first?
>>>
>>> On Thu, Jun 13, 2019 at 6:17 AM Dongjoon Hyun <
>>> dongjoon.h...@gmail.com> wrote:
>>>
 Hi, All.

 Since we use both Apache JIRA and GitHub actively for Apache Spark
 contributions, we have lots of JIRAs and PRs consequently. One specific
 thing I've been longing to see is `Jira Issue Type` in GitHub.

 How about exposing JIRA issue types at GitHub PRs as GitHub
 `Labels`? There are two main benefits:
 1. It helps the communication between the contributors and
 reviewers with more information.
 (In some cases, some people only visit GitHub to see the PR and
 commits)
 2. `Labels` is searchable. We don't need to visit Apache Jira to
 search PRs to see a specific type.
 (For example, the reviewers can see and review 'BUG' PRs first
 by using `is:open is:pr label:BUG`.)

 Of course, this can be done automatically without human
 intervention. Since we already have GitHub Jenkins job to access
 JIRA/GitHub, that job can add the labels from the beginning. If 
 needed, I
 can volunteer to update the script.

 To show the demo, I labeled several PRs manually. You can see the
 result right now in Apache Spark PR page.

   - https://github.com/apache/spark/pulls

 If you're surprised due to those manual activities, I want to
 apologize for that. I hope we can take advantage of the existing GitHub
 features to serve Apache Spark community in a way better than 
 yesterday.

 How do you think about this specific suggestion?

 Bests,
 Dongjoon

 PS. I saw that `Request Review` and `Assign` features are already
 used for some purposes, but these feature are out of the scope in this
 email.

>>>


Re: Spark read csv option - capture exception in a column in permissive mode

2019-06-17 Thread Anselmi Rodriguez, Agustina, Vodafone UK
You can sort of hack this by reading it as an RDD[String] and trying to 
implement a custom parser i.e.

Val rddRows = rdd.map parseMyCols

Def parseMyCols(rawVal: String) : Row = {
parse(rawVal) match {
case Success(parsedRowValues) = > Row(parsedRowValues :+ “”: _*)
case Failure(exception) => Row( nullList :+ exception.getMessage )
}
}

Hope this helps

On 17 Jun 2019, at 06:31, Ajay Thompson 
mailto:ajay.thomp...@thedatateam.in>> wrote:

There's a column which captures the corrupted record. However, the exception 
isn't captured. If the exception is captured in another column it'll be very 
useful.

On Mon, 17 Jun, 2019, 10:56 AM Gourav Sengupta, 
mailto:gourav.sengu...@gmail.com>> wrote:
Hi,

it already does, I think, you just have to add the column in the schema that 
you are using to read.

Regards,
Gourav

On Sun, Jun 16, 2019 at 2:48 PM 
mailto:ajay.thomp...@thedatateam.in>> wrote:
Hi Team,

Can we have another column which gives the corrupted record reason in 
permissive mode while reading csv.

Thanks,
Ajay



Re: A basic question

2019-06-17 Thread Deepak Sharma
You can follow this example:
https://docs.spring.io/spring-hadoop/docs/current/reference/html/springandhadoop-spark.html


On Mon, Jun 17, 2019 at 12:27 PM Shyam P  wrote:

> I am developing a spark job using java1.8v.
>
> Is it possible to write a spark app using spring-boot technology?
> Did anyone tried it ? if so how it should be done?
>
>
> Regards,
> Shyam
>


-- 
Thanks
Deepak
www.bigdatabig.com
www.keosha.net


A basic question

2019-06-17 Thread Shyam P
I am developing a spark job using java1.8v.

Is it possible to write a spark app using spring-boot technology?
Did anyone tried it ? if so how it should be done?


Regards,
Shyam


Re: Exposing JIRA issue types at GitHub PRs

2019-06-17 Thread Dongjoon Hyun
Thank you, Hyukjin !

On Sun, Jun 16, 2019 at 4:12 PM Hyukjin Kwon  wrote:

> Labels look good and useful.
>
> On Sat, 15 Jun 2019, 02:36 Dongjoon Hyun,  wrote:
>
>> Now, you can see the exposed component labels (ordered by the number of
>> PRs) here and click the component to search.
>>
>> https://github.com/apache/spark/labels?sort=count-desc
>>
>> Dongjoon.
>>
>>
>> On Fri, Jun 14, 2019 at 1:15 AM Dongjoon Hyun 
>> wrote:
>>
>>> Hi, All.
>>>
>>> JIRA and PR is ready for reviews.
>>>
>>> https://issues.apache.org/jira/browse/SPARK-28051 (Exposing JIRA issue
>>> component types at GitHub PRs)
>>> https://github.com/apache/spark/pull/24871
>>>
>>> Bests,
>>> Dongjoon.
>>>
>>>
>>> On Thu, Jun 13, 2019 at 10:48 AM Dongjoon Hyun 
>>> wrote:
>>>
 Thank you for the feedbacks and requirements, Hyukjin, Reynold, Marco.

 Sure, we can do whatever we want.

 I'll wait for more feedbacks and proceed to the next steps.

 Bests,
 Dongjoon.


 On Wed, Jun 12, 2019 at 11:51 PM Marco Gaido 
 wrote:

> Hi Dongjoon,
> Thanks for the proposal! I like the idea. Maybe we can extend it to
> component too and to some jira labels such as correctness which may be
> worth to highlight in PRs too. My only concern is that in many cases JIRAs
> are created not very carefully so they may be incorrect at the moment of
> the pr creation and it may be updated later: so keeping them in sync may 
> be
> an extra effort..
>
> On Thu, 13 Jun 2019, 08:09 Reynold Xin,  wrote:
>
>> Seems like a good idea. Can we test this with a component first?
>>
>> On Thu, Jun 13, 2019 at 6:17 AM Dongjoon Hyun <
>> dongjoon.h...@gmail.com> wrote:
>>
>>> Hi, All.
>>>
>>> Since we use both Apache JIRA and GitHub actively for Apache Spark
>>> contributions, we have lots of JIRAs and PRs consequently. One specific
>>> thing I've been longing to see is `Jira Issue Type` in GitHub.
>>>
>>> How about exposing JIRA issue types at GitHub PRs as GitHub
>>> `Labels`? There are two main benefits:
>>> 1. It helps the communication between the contributors and reviewers
>>> with more information.
>>> (In some cases, some people only visit GitHub to see the PR and
>>> commits)
>>> 2. `Labels` is searchable. We don't need to visit Apache Jira to
>>> search PRs to see a specific type.
>>> (For example, the reviewers can see and review 'BUG' PRs first
>>> by using `is:open is:pr label:BUG`.)
>>>
>>> Of course, this can be done automatically without human
>>> intervention. Since we already have GitHub Jenkins job to access
>>> JIRA/GitHub, that job can add the labels from the beginning. If needed, 
>>> I
>>> can volunteer to update the script.
>>>
>>> To show the demo, I labeled several PRs manually. You can see the
>>> result right now in Apache Spark PR page.
>>>
>>>   - https://github.com/apache/spark/pulls
>>>
>>> If you're surprised due to those manual activities, I want to
>>> apologize for that. I hope we can take advantage of the existing GitHub
>>> features to serve Apache Spark community in a way better than yesterday.
>>>
>>> How do you think about this specific suggestion?
>>>
>>> Bests,
>>> Dongjoon
>>>
>>> PS. I saw that `Request Review` and `Assign` features are already
>>> used for some purposes, but these feature are out of the scope in this
>>> email.
>>>
>>