Re: spark session jdbc performance

2017-10-24 Thread Gourav Sengupta
Hi Naveen,

I do not think that it is prudent to use the PK as the partitionColumn.
That is too many partitions for any system to handle. The numPartitions
will be valid in case of JDBC very differently.

Please keep me updated on how things go.


Regards,
Gourav Sengupta

On Tue, Oct 24, 2017 at 10:54 PM, Naveen Madhire 
wrote:

>
> Hi,
>
>
>
> I am trying to fetch data from Oracle DB using a subquery and experiencing
> lot of performance issues.
>
>
>
> Below is the query I am using,
>
>
>
> *Using Spark 2.0.2*
>
>
>
> *val *df = spark_session.read.format(*"jdbc"*)
> .option(*"driver"*,*"*oracle.jdbc.OracleDriver*"*)
> .option(*"url"*, jdbc_url)
>.option(*"user"*, user)
>.option(*"password"*, pwd)
>.option(*"dbtable"*, *"subquery"*)
>.option(*"partitionColumn"*, *"id"*)  //primary key column uniformly
> distributed
>.option(*"lowerBound"*, *"1"*)
>.option(*"upperBound"*, *"50"*)
> .option(*"numPartitions"*, 30)
> .load()
>
>
>
> The above query is running using the 30 partitions, but when I see the UI
> it is only using 1 partiton to run the query.
>
>
>
> Can anyone tell if I am missing anything or do I need to anything else to
> tune the performance of the query.
>
>  *Thanks*
>


Using Spark 2.2.0 SparkSession extensions to optimize file filtering

2017-10-24 Thread Chris Luby
I have an external catalog that has additional information on my Parquet files 
that I want to match up with the parsed filters from the plan to prune the list 
of files included in the scan.  I’m looking at doing this using the Spark 2.2.0 
SparkSession extensions similar to the built in partition pruning:

https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PruneFileSourcePartitions.scala

and this other project that is along the lines of what I want:

https://github.com/lightcopy/parquet-index/blob/master/src/main/scala/org/apache/spark/sql/execution/datasources/IndexSourceStrategy.scala

but isn’t caught up to 2.2.0, but I’m struggling to understand what type of 
extension I would use to do something like the above:

https://spark.apache.org/docs/2.2.0/api/scala/index.html#org.apache.spark.sql.SparkSessionExtensions

and if this is the appropriate strategy for this.

Are there any examples out there for using the new extension hooks to alter the 
files included in the plan?

Thanks.


Re: Is Spark suited for this use case?

2017-10-24 Thread Gourav Sengupta
Hi Saravanan,

SPARK may be free, but to make it run with the same level of performance,
consistency, and reliability will show you that SPARK or HADOOP or anything
else is essentially not free. With Informatica you pay for the licensing
and have almost no headaches as far as stability, upgrades, and reliability
is concerned.

If you want to deliver the same with SPARK, then the costs will start
escalating as you will have to go with SPARK vendors.

As with everything else my best analogy is dont use a fork for drinking
soup. SPARK works wonderfully with huge scale of data, SPARK cannot read
Oracle Binary Log files or provides a change data capture capability, for
your used case with such low volumes and solutions like Informatica and
Golden Gate, I think that you are already using optimal solutions.

Of course I am presuming that you DO NOT replicate your entire 6TB of MDM
platform everyday and just use CDC to transfer data to your data center for
reporting purposes.

In case you are interested in super fast reporting using AWS Redshift then
please do let me know, I have delivered several end-to-end hybrid data
warehouse solutions and will be happy to help you with the same.


Regards,
Gourav Sengupta

On Mon, Oct 16, 2017 at 5:32 AM, van den Heever, Christian CC <
christian.vandenhee...@standardbank.co.za> wrote:

> Hi,
>
>
>
> We basically have the same scenario but worldwide as we have bigger
> Datasets we use OGG à local à Sqoop Into Hadoop.
>
> By all means you can have spark reading the oracle tables and then do some
> changes to data in need which will not be done on scoop qry. Ie fraudulent
> detection on transaction records.
>
>
>
> But some time the simplest way is the best. Unless you need a change or
> need more then I would advise not using another hop.
>
> I would rather move away from files as OGG can do files and direct table
> loading then sqoop for the rest.
>
>
>
> Simpler is better.
>
>
>
> Hope this helps.
>
> C.
>
>
>
> *From:* Saravanan Thirumalai [mailto:saravanan.thiruma...@gmail.com]
> *Sent:* Monday, 16 October 2017 4:29 AM
> *To:* user@spark.apache.org
> *Subject:* Is Spark suited for this use case?
>
>
>
> We are an Investment firm and have a MDM platform in oracle at a vendor
> location and use Oracle Golden Gate to replicat data to our data center for
> reporting needs.
>
> Our data is not big data (total size 6 TB including 2 TB of archive data).
> Moreover our data doesn't get updated often, nightly once (around 50 MB)
> and some correction transactions during the day (<10 MB). We don't have
> external users and hence data doesn't grow real-time like e-commerce.
>
>
>
> When we replicate data from source to target, we transfer data through
> files. So, if there are DML operations (corrections) during day time on a
> source table, the corresponding file would have probably 100 lines of table
> data that needs to be loaded into the target database. Due to low volume of
> data we designed this through Informatica and this works in less than 2-5
> minutes. Can Spark be used in this case or would it be an overkill of
> technology use?
>
>
>
>
>
>
>
>
> Standard Bank email disclaimer and confidentiality note
> Please go to www.standardbank.co.za/site/homepage/emaildisclaimer.html to 
> read our email disclaimer and confidentiality note. Kindly email 
> disclai...@standardbank.co.za (no content or subject line necessary) if you 
> cannot view that page and we will email our email disclaimer and 
> confidentiality note to you.
>
>
>
>


Re: spark session jdbc performance

2017-10-24 Thread Srinivasa Reddy Tatiredidgari
Hi, is the subquery is user defined sqls or table name in db.If it is user 
Defined sql.Make sure ur partition column is in main select clause.

Sent from Yahoo Mail on Android 
 
  On Wed, Oct 25, 2017 at 3:25, Naveen Madhire wrote:   

Hi,

 

I am trying to fetch data from Oracle DB using a subquery and experiencing lot 
of performance issues.

 

Below is the query I am using,

 

Using Spark 2.0.2

 

val df = spark_session.read.format("jdbc")
.option("driver","oracle.jdbc.OracleDriver")
.option("url", jdbc_url)
   .option("user", user)
   .option("password", pwd)
   .option("dbtable", "subquery")
   .option("partitionColumn", "id")  //primary key column uniformly distributed
   .option("lowerBound", "1")
   .option("upperBound", "50")
.option("numPartitions", 30)
.load()

 

The above query is running using the 30 partitions, but when I see the UI it is 
only using 1 partiton to run the query.

 

Can anyone tell if I am missing anything or do I need to anything else to tune 
the performance of the query.

 Thanks
  


Re: Orc predicate pushdown with Spark Sql

2017-10-24 Thread Jörn Franke
Well the meta information is in the file so I am not surprised that it reads 
the file, but it should not read all the content, which is probably also not 
happening. 

> On 24. Oct 2017, at 18:16, Siva Gudavalli  
> wrote:
> 
> 
> Hello,
>  
> I have an update here. 
>  
> spark SQL is pushing predicates down, if I load the orc files in spark 
> Context and Is not the same when I try to read hive Table directly.
> please let me know if i am missing something here.
>  
> Is this supported in spark  ? 
>  
> when I load the files in spark Context 
> scala> val hlogsv5 = 
> sqlContext.read.format("orc").load("/user/hive/warehouse/hlogsv5")
> 17/10/24 16:11:15 INFO OrcRelation: Listing 
> maprfs:///user/hive/warehouse/hlogsv5 on driver
> 17/10/24 16:11:15 INFO OrcRelation: Listing 
> maprfs:///user/hive/warehouse/hlogsv5/cdt=20171003 on driver
> 17/10/24 16:11:15 INFO OrcRelation: Listing 
> maprfs:///user/hive/warehouse/hlogsv5/cdt=20171003/catpartkey=others on 
> driver
> 17/10/24 16:11:15 INFO OrcRelation: Listing 
> maprfs:///user/hive/warehouse/hlogsv5/cdt=20171003/catpartkey=others/usrpartkey=hhhUsers
>  on driver
> hlogsv5: org.apache.spark.sql.DataFrame = [id: string, chn: string, ht: 
> string, br: string, rg: string, cat: int, scat: int, usr: string, org: 
> string, act: int, ctm: int, c1: string, c2: string, c3: string, d1: int, d2: 
> int, doc: binary, cdt: int, catpartkey: string, usrpartkey: string]
> scala> hlogsv5.registerTempTable("tempo")
> scala> sqlContext.sql ( "selecT id from tempo where cdt=20171003 and 
> usrpartkey = 'hhhUsers' and usr='AA0YP' order by id desc limit 10" ).explain
> 17/10/24 16:11:22 INFO ParseDriver: Parsing command: selecT id from tempo 
> where cdt=20171003 and usrpartkey = 'hhhUsers' and usr='AA0YP' order by id 
> desc limit 10
> 17/10/24 16:11:22 INFO ParseDriver: Parse Completed
> 17/10/24 16:11:22 INFO DataSourceStrategy: Selected 1 partitions out of 1, 
> pruned 0.0% partitions.
> 17/10/24 16:11:22 INFO MemoryStore: Block broadcast_6 stored as values in 
> memory (estimated size 164.5 KB, free 468.0 KB)
> 17/10/24 16:11:22 INFO MemoryStore: Block broadcast_6_piece0 stored as bytes 
> in memory (estimated size 18.3 KB, free 486.4 KB)
> 17/10/24 16:11:22 INFO BlockManagerInfo: Added broadcast_6_piece0 in memory 
> on 172.21.158.61:43493 (size: 18.3 KB, free: 511.4 MB)
> 17/10/24 16:11:22 INFO SparkContext: Created broadcast 6 from explain at 
> :33
> 17/10/24 16:11:22 INFO MemoryStore: Block broadcast_7 stored as values in 
> memory (estimated size 170.2 KB, free 656.6 KB)
> 17/10/24 16:11:22 INFO MemoryStore: Block broadcast_7_piece0 stored as bytes 
> in memory (estimated size 18.8 KB, free 675.4 KB)
> 17/10/24 16:11:22 INFO BlockManagerInfo: Added broadcast_7_piece0 in memory 
> on 172.21.158.61:43493 (size: 18.8 KB, free: 511.4 MB)
> 17/10/24 16:11:22 INFO SparkContext: Created broadcast 7 from explain at 
> :33
> == Physical Plan ==
> TakeOrderedAndProject(limit=10, orderBy=[id#145 DESC], output=[id#145])
> +- ConvertToSafe
> +- Project [id#145]
> +- Filter (usr#152 = AA0YP)
> +- Scan OrcRelation[id#145,usr#152] InputPaths: 
> maprfs:///user/hive/warehouse/hlogsv5, PushedFilters: [EqualTo(usr,AA0YP)]
>  
> when i read this as hive Table 
>  
> scala> sqlContext.sql ( "selecT id from hlogsv5 where cdt=20171003 and 
> usrpartkey = 'hhhUsers' and usr='AA0YP' order by id desc limit 10" ).explain
> 17/10/24 16:11:32 INFO ParseDriver: Parsing command: selecT id from 
> hlogsv5 where cdt=20171003 and usrpartkey = 'hhhUsers' and usr='AA0YP' 
> order by id desc limit 10
> 17/10/24 16:11:32 INFO ParseDriver: Parse Completed
> 17/10/24 16:11:32 INFO MemoryStore: Block broadcast_8 stored as values in 
> memory (estimated size 399.1 KB, free 1074.6 KB)
> 17/10/24 16:11:32 INFO MemoryStore: Block broadcast_8_piece0 stored as bytes 
> in memory (estimated size 42.7 KB, free 1117.2 KB)
> 17/10/24 16:11:32 INFO BlockManagerInfo: Added broadcast_8_piece0 in memory 
> on 172.21.158.61:43493 (size: 42.7 KB, free: 511.4 MB)
> 17/10/24 16:11:32 INFO SparkContext: Created broadcast 8 from explain at 
> :33
> == Physical Plan ==
> TakeOrderedAndProject(limit=10, orderBy=[id#192 DESC], output=[id#192])
> +- ConvertToSafe
> +- Project [id#192]
> +- Filter (usr#199 = AA0YP)
> +- HiveTableScan [id#192,usr#199], MetastoreRelation default, hlogsv5, 
> None, [(cdt#189 = 20171003),(usrpartkey#191 = hhhUsers)]
>  
>  
> please let me know if i am missing anything here. thank you
> 
> 
> On Monday, October 23, 2017 1:56 PM, Siva Gudavalli  
> wrote:
> 
> 
> Hello,
>  
> I am working with Spark SQL to query Hive Managed Table (in Orc Format)
>  
> I have my data organized by partitions and asked to set indexes for each 
> 50,000 Rows by setting ('orc.row.index.stride'='5') 
>  
> lets say -> after evaluating partition there are around 50 files in which 
> data is organized.
>  
> Each file contains data specific to one given "cat" and I have

Re: spark session jdbc performance

2017-10-24 Thread lucas.g...@gmail.com
Sorry, I meant to say: "That code looks SANE to me"

Assuming that you're seeing the query running partitioned as expected then
you're likely configured with one executor.  Very easy to check in the UI.

Gary Lucas

On 24 October 2017 at 16:09, lucas.g...@gmail.com 
wrote:

> Did you check the query plan / check the UI?
>
> That code looks same to me.  Maybe you've only configured for one executor?
>
> Gary
>
> On Oct 24, 2017 2:55 PM, "Naveen Madhire"  wrote:
>
>>
>> Hi,
>>
>>
>>
>> I am trying to fetch data from Oracle DB using a subquery and
>> experiencing lot of performance issues.
>>
>>
>>
>> Below is the query I am using,
>>
>>
>>
>> *Using Spark 2.0.2*
>>
>>
>>
>> *val *df = spark_session.read.format(*"jdbc"*)
>> .option(*"driver"*,*"*oracle.jdbc.OracleDriver*"*)
>> .option(*"url"*, jdbc_url)
>>.option(*"user"*, user)
>>.option(*"password"*, pwd)
>>.option(*"dbtable"*, *"subquery"*)
>>.option(*"partitionColumn"*, *"id"*)  //primary key column uniformly
>> distributed
>>.option(*"lowerBound"*, *"1"*)
>>.option(*"upperBound"*, *"50"*)
>> .option(*"numPartitions"*, 30)
>> .load()
>>
>>
>>
>> The above query is running using the 30 partitions, but when I see the UI
>> it is only using 1 partiton to run the query.
>>
>>
>>
>> Can anyone tell if I am missing anything or do I need to anything else to
>> tune the performance of the query.
>>
>>  *Thanks*
>>
>


Re: Spark streaming for CEP

2017-10-24 Thread lucas.g...@gmail.com
This looks really interesting, thanks for linking!

Gary Lucas

On 24 October 2017 at 15:06, Mich Talebzadeh 
wrote:

> Great thanks Steve
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 24 October 2017 at 22:58, Stephen Boesch  wrote:
>
>> Hi Mich, the github link has a brief intro - including a link to the
>> formal docs http://logisland.readthedocs.io/en/latest/index.html .
>>  They have an architectural overview, developer guide, tutorial, and pretty
>> comprehensive api docs.
>>
>> 2017-10-24 13:31 GMT-07:00 Mich Talebzadeh :
>>
>>> thanks Thomas.
>>>
>>> do you have a summary write-up for this tool please?
>>>
>>>
>>> regards,
>>>
>>>
>>>
>>>
>>> Thomas
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> *
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>> On 24 October 2017 at 13:53, Thomas Bailet 
>>> wrote:
>>>
 Hi

 we (@ hurence) have released on open source middleware based on
 SparkStreaming over Kafka to do CEP and log mining, called *logisland*
 (https://github.com/Hurence/logisland/) it has been deployed into
 production for 2 years now and does a great job. You should have a look.


 bye

 Thomas Bailet

 CTO : hurence

 Le 18/10/17 à 22:05, Mich Talebzadeh a écrit :

 As you may be aware the granularity that Spark streaming has is
 micro-batching and that is limited to 0.5 second. So if you have continuous
 ingestion of data then Spark streaming may not be granular enough for CEP.
 You may consider other products.

 Worth looking at this old thread on mine "Spark support for Complex
 Event Processing (CEP)

 https://mail-archives.apache.org/mod_mbox/spark-user/201604.
 mbox/%3CCAJ3fcbB8eaf0JV84bA7XGUK5GajC1yGT3ZgTNCi8arJg56=LbQ@
 mail.gmail.com%3E

 HTH


 Dr Mich Talebzadeh



 LinkedIn * 
 https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
 *



 http://talebzadehmich.wordpress.com


 *Disclaimer:* Use it at your own risk. Any and all responsibility for
 any loss, damage or destruction of data or any other property which may
 arise from relying on this email's technical content is explicitly
 disclaimed. The author will in no case be liable for any monetary damages
 arising from such loss, damage or destruction.



 On 18 October 2017 at 20:52, anna stax  wrote:

> Hello all,
>
> Has anyone used spark streaming for CEP (Complex Event processing).
> Any CEP libraries that works well with spark. I have a use case for CEP 
> and
> trying to see if spark streaming is a good fit.
>
> Currently we have a data pipeline using Kafka, Spark streaming and
> Cassandra for data ingestion and near real time dashboard.
>
> Please share your experience.
> Thanks much.
> -Anna
>
>
>


>>>
>>
>


Null array of cols

2017-10-24 Thread Mohit Anchlia
I am trying to understand the best way to handle the scenario where null
array "[]" is passed. Can somebody suggest if there is a way to filter out
such records. I've tried numerous things including using
dataframe.head().isEmpty but pyspark doesn't recognize isEmpty even though
I see it in the API docs.

pyspark.sql.utils.AnalysisException: u"cannot resolve '`timestamp`' given
input columns: []; line 1 pos 0;\n'Filter isnotnull('timestamp)\n+-
LogicalRDD\n"


Re: spark session jdbc performance

2017-10-24 Thread lucas.g...@gmail.com
Did you check the query plan / check the UI?

That code looks same to me.  Maybe you've only configured for one executor?

Gary

On Oct 24, 2017 2:55 PM, "Naveen Madhire"  wrote:

>
> Hi,
>
>
>
> I am trying to fetch data from Oracle DB using a subquery and experiencing
> lot of performance issues.
>
>
>
> Below is the query I am using,
>
>
>
> *Using Spark 2.0.2*
>
>
>
> *val *df = spark_session.read.format(*"jdbc"*)
> .option(*"driver"*,*"*oracle.jdbc.OracleDriver*"*)
> .option(*"url"*, jdbc_url)
>.option(*"user"*, user)
>.option(*"password"*, pwd)
>.option(*"dbtable"*, *"subquery"*)
>.option(*"partitionColumn"*, *"id"*)  //primary key column uniformly
> distributed
>.option(*"lowerBound"*, *"1"*)
>.option(*"upperBound"*, *"50"*)
> .option(*"numPartitions"*, 30)
> .load()
>
>
>
> The above query is running using the 30 partitions, but when I see the UI
> it is only using 1 partiton to run the query.
>
>
>
> Can anyone tell if I am missing anything or do I need to anything else to
> tune the performance of the query.
>
>  *Thanks*
>


Re: Spark streaming for CEP

2017-10-24 Thread Mich Talebzadeh
Great thanks Steve

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 24 October 2017 at 22:58, Stephen Boesch  wrote:

> Hi Mich, the github link has a brief intro - including a link to the
> formal docs http://logisland.readthedocs.io/en/latest/index.html .   They
> have an architectural overview, developer guide, tutorial, and pretty
> comprehensive api docs.
>
> 2017-10-24 13:31 GMT-07:00 Mich Talebzadeh :
>
>> thanks Thomas.
>>
>> do you have a summary write-up for this tool please?
>>
>>
>> regards,
>>
>>
>>
>>
>> Thomas
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 24 October 2017 at 13:53, Thomas Bailet 
>> wrote:
>>
>>> Hi
>>>
>>> we (@ hurence) have released on open source middleware based on
>>> SparkStreaming over Kafka to do CEP and log mining, called *logisland*
>>> (https://github.com/Hurence/logisland/) it has been deployed into
>>> production for 2 years now and does a great job. You should have a look.
>>>
>>>
>>> bye
>>>
>>> Thomas Bailet
>>>
>>> CTO : hurence
>>>
>>> Le 18/10/17 à 22:05, Mich Talebzadeh a écrit :
>>>
>>> As you may be aware the granularity that Spark streaming has is
>>> micro-batching and that is limited to 0.5 second. So if you have continuous
>>> ingestion of data then Spark streaming may not be granular enough for CEP.
>>> You may consider other products.
>>>
>>> Worth looking at this old thread on mine "Spark support for Complex
>>> Event Processing (CEP)
>>>
>>> https://mail-archives.apache.org/mod_mbox/spark-user/201604.
>>> mbox/%3CCAJ3fcbB8eaf0JV84bA7XGUK5GajC1yGT3ZgTNCi8arJg56=LbQ@
>>> mail.gmail.com%3E
>>>
>>> HTH
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> *
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>> On 18 October 2017 at 20:52, anna stax  wrote:
>>>
 Hello all,

 Has anyone used spark streaming for CEP (Complex Event processing).
 Any CEP libraries that works well with spark. I have a use case for CEP and
 trying to see if spark streaming is a good fit.

 Currently we have a data pipeline using Kafka, Spark streaming and
 Cassandra for data ingestion and near real time dashboard.

 Please share your experience.
 Thanks much.
 -Anna



>>>
>>>
>>
>


Re: Spark streaming for CEP

2017-10-24 Thread Stephen Boesch
Hi Mich, the github link has a brief intro - including a link to the formal
docs http://logisland.readthedocs.io/en/latest/index.html .   They have an
architectural overview, developer guide, tutorial, and pretty comprehensive
api docs.

2017-10-24 13:31 GMT-07:00 Mich Talebzadeh :

> thanks Thomas.
>
> do you have a summary write-up for this tool please?
>
>
> regards,
>
>
>
>
> Thomas
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 24 October 2017 at 13:53, Thomas Bailet 
> wrote:
>
>> Hi
>>
>> we (@ hurence) have released on open source middleware based on
>> SparkStreaming over Kafka to do CEP and log mining, called *logisland*  (
>> https://github.com/Hurence/logisland/) it has been deployed into
>> production for 2 years now and does a great job. You should have a look.
>>
>>
>> bye
>>
>> Thomas Bailet
>>
>> CTO : hurence
>>
>> Le 18/10/17 à 22:05, Mich Talebzadeh a écrit :
>>
>> As you may be aware the granularity that Spark streaming has is
>> micro-batching and that is limited to 0.5 second. So if you have continuous
>> ingestion of data then Spark streaming may not be granular enough for CEP.
>> You may consider other products.
>>
>> Worth looking at this old thread on mine "Spark support for Complex Event
>> Processing (CEP)
>>
>> https://mail-archives.apache.org/mod_mbox/spark-user/201604.
>> mbox/%3CCAJ3fcbB8eaf0JV84bA7XGUK5GajC1yGT3ZgTNCi8arJg56=LbQ@
>> mail.gmail.com%3E
>>
>> HTH
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 18 October 2017 at 20:52, anna stax  wrote:
>>
>>> Hello all,
>>>
>>> Has anyone used spark streaming for CEP (Complex Event processing).  Any
>>> CEP libraries that works well with spark. I have a use case for CEP and
>>> trying to see if spark streaming is a good fit.
>>>
>>> Currently we have a data pipeline using Kafka, Spark streaming and
>>> Cassandra for data ingestion and near real time dashboard.
>>>
>>> Please share your experience.
>>> Thanks much.
>>> -Anna
>>>
>>>
>>>
>>
>>
>


spark session jdbc performance

2017-10-24 Thread Naveen Madhire
Hi,



I am trying to fetch data from Oracle DB using a subquery and experiencing
lot of performance issues.



Below is the query I am using,



*Using Spark 2.0.2*



*val *df = spark_session.read.format(*"jdbc"*)
.option(*"driver"*,*"*oracle.jdbc.OracleDriver*"*)
.option(*"url"*, jdbc_url)
   .option(*"user"*, user)
   .option(*"password"*, pwd)
   .option(*"dbtable"*, *"subquery"*)
   .option(*"partitionColumn"*, *"id"*)  //primary key column uniformly
distributed
   .option(*"lowerBound"*, *"1"*)
   .option(*"upperBound"*, *"50"*)
.option(*"numPartitions"*, 30)
.load()



The above query is running using the 30 partitions, but when I see the UI
it is only using 1 partiton to run the query.



Can anyone tell if I am missing anything or do I need to anything else to
tune the performance of the query.

 *Thanks*


spark session jdbc performance

2017-10-24 Thread Madhire, Naveen
Hi,

I am trying to fetch data from Oracle DB using a subquery and experiencing lot 
of performance issues.

Below is the query I am using,

Using Spark 2.0.2

val df = spark_session.read.format("jdbc")
.option("driver","oracle.jdbc.OracleDriver")
.option("url", jdbc_url)
   .option("user", user)
   .option("password", pwd)
   .option("dbtable", "subquery")
   .option("partitionColumn", "id")  //primary key column uniformly distributed
   .option("lowerBound", "1")
   .option("upperBound", "50")
.option("numPartitions", 30)
.load()

The above query is running using the 30 partitions, but when I see the UI it is 
only using 1 partiton to run the query.

Can anyone tell if I am missing anything or do I need to anything else to tune 
the performance of the query.

Thanks


The information contained in this e-mail is confidential and/or proprietary to 
Capital One and/or its affiliates and may only be used solely in performance of 
work or services for Capital One. The information transmitted herewith is 
intended only for use by the individual or entity to which it is addressed. If 
the reader of this message is not the intended recipient, you are hereby 
notified that any review, retransmission, dissemination, distribution, copying 
or other use of, or taking of any action in reliance upon this information is 
strictly prohibited. If you have received this communication in error, please 
contact the sender and delete the material from your computer.


spark session jdbc performance

2017-10-24 Thread Madhire, Naveen
Hi,

I am trying to fetch data from Oracle DB using a subquery and experiencing lot 
of performance issues.

Below is the query I am using,

Using Spark 2.0.2

val df = spark_session.read.format("jdbc")
.option("driver","oracle.jdbc.OracleDriver")
.option("url", jdbc_url)
   .option("user", user)
   .option("password", pwd)
   .option("dbtable", "subquery")
   .option("partitionColumn", "id")  //primary key column uniformly distributed
   .option("lowerBound", "1")
   .option("upperBound", "50")
.option("numPartitions", 30)
.load()

The above query is running using the 30 partitions, but when I see the UI it is 
only using 1 partiton to run the query.

Can anyone tell if I am missing anything or do I need to anything else to tune 
the performance of the query.

Thanks


The information contained in this e-mail is confidential and/or proprietary to 
Capital One and/or its affiliates and may only be used solely in performance of 
work or services for Capital One. The information transmitted herewith is 
intended only for use by the individual or entity to which it is addressed. If 
the reader of this message is not the intended recipient, you are hereby 
notified that any review, retransmission, dissemination, distribution, copying 
or other use of, or taking of any action in reliance upon this information is 
strictly prohibited. If you have received this communication in error, please 
contact the sender and delete the material from your computer.


Re: Spark streaming for CEP

2017-10-24 Thread Mich Talebzadeh
thanks Thomas.

do you have a summary write-up for this tool please?


regards,




Thomas

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 24 October 2017 at 13:53, Thomas Bailet 
wrote:

> Hi
>
> we (@ hurence) have released on open source middleware based on
> SparkStreaming over Kafka to do CEP and log mining, called *logisland*  (
> https://github.com/Hurence/logisland/) it has been deployed into
> production for 2 years now and does a great job. You should have a look.
>
>
> bye
>
> Thomas Bailet
>
> CTO : hurence
>
> Le 18/10/17 à 22:05, Mich Talebzadeh a écrit :
>
> As you may be aware the granularity that Spark streaming has is
> micro-batching and that is limited to 0.5 second. So if you have continuous
> ingestion of data then Spark streaming may not be granular enough for CEP.
> You may consider other products.
>
> Worth looking at this old thread on mine "Spark support for Complex Event
> Processing (CEP)
>
> https://mail-archives.apache.org/mod_mbox/spark-user/201604.mbox/%
> 3CCAJ3fcbB8eaf0JV84bA7XGUK5GajC1yGT3ZgTNCi8arJg56=l...@mail.gmail.com%3E
>
> HTH
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 18 October 2017 at 20:52, anna stax  wrote:
>
>> Hello all,
>>
>> Has anyone used spark streaming for CEP (Complex Event processing).  Any
>> CEP libraries that works well with spark. I have a use case for CEP and
>> trying to see if spark streaming is a good fit.
>>
>> Currently we have a data pipeline using Kafka, Spark streaming and
>> Cassandra for data ingestion and near real time dashboard.
>>
>> Please share your experience.
>> Thanks much.
>> -Anna
>>
>>
>>
>
>


Re: Orc predicate pushdown with Spark Sql

2017-10-24 Thread Siva Gudavalli

Hello, I have an update here.  spark SQL is pushing predicates down, if I load 
the orc files in spark Context and Is not the same when I try to read hive 
Table directly.please let me know if i am missing something here. Is this 
supported in spark  ?  when I load the files in spark Context 
scala> val hlogsv5 = 
sqlContext.read.format("orc").load("/user/hive/warehouse/hlogsv5")
17/10/24 16:11:15 INFO OrcRelation: Listing 
maprfs:///user/hive/warehouse/hlogsv5 on driver
17/10/24 16:11:15 INFO OrcRelation: Listing 
maprfs:///user/hive/warehouse/hlogsv5/cdt=20171003 on driver
17/10/24 16:11:15 INFO OrcRelation: Listing 
maprfs:///user/hive/warehouse/hlogsv5/cdt=20171003/catpartkey=others on 
driver
17/10/24 16:11:15 INFO OrcRelation: Listing 
maprfs:///user/hive/warehouse/hlogsv5/cdt=20171003/catpartkey=others/usrpartkey=hhhUsers
 on driver
hlogsv5: org.apache.spark.sql.DataFrame = [id: string, chn: string, ht: 
string, br: string, rg: string, cat: int, scat: int, usr: string, org: string, 
act: int, ctm: int, c1: string, c2: string, c3: string, d1: int, d2: int, doc: 
binary, cdt: int, catpartkey: string, usrpartkey: string]scala> 
hlogsv5.registerTempTable("tempo")scala> sqlContext.sql ( "selecT id from 
tempo where cdt=20171003 and usrpartkey = 'hhhUsers' and usr='AA0YP' order by 
id desc limit 10" ).explain
17/10/24 16:11:22 INFO ParseDriver: Parsing command: selecT id from tempo where 
cdt=20171003 and usrpartkey = 'hhhUsers' and usr='AA0YP' order by id desc limit 
10
17/10/24 16:11:22 INFO ParseDriver: Parse Completed
17/10/24 16:11:22 INFO DataSourceStrategy: Selected 1 partitions out of 1, 
pruned 0.0% partitions.
17/10/24 16:11:22 INFO MemoryStore: Block broadcast_6 stored as values in 
memory (estimated size 164.5 KB, free 468.0 KB)
17/10/24 16:11:22 INFO MemoryStore: Block broadcast_6_piece0 stored as bytes in 
memory (estimated size 18.3 KB, free 486.4 KB)
17/10/24 16:11:22 INFO BlockManagerInfo: Added broadcast_6_piece0 in memory on 
172.21.158.61:43493 (size: 18.3 KB, free: 511.4 MB)
17/10/24 16:11:22 INFO SparkContext: Created broadcast 6 from explain at 
:33
17/10/24 16:11:22 INFO MemoryStore: Block broadcast_7 stored as values in 
memory (estimated size 170.2 KB, free 656.6 KB)
17/10/24 16:11:22 INFO MemoryStore: Block broadcast_7_piece0 stored as bytes in 
memory (estimated size 18.8 KB, free 675.4 KB)
17/10/24 16:11:22 INFO BlockManagerInfo: Added broadcast_7_piece0 in memory on 
172.21.158.61:43493 (size: 18.8 KB, free: 511.4 MB)
17/10/24 16:11:22 INFO SparkContext: Created broadcast 7 from explain at 
:33
== Physical Plan ==
TakeOrderedAndProject(limit=10, orderBy=[id#145 DESC], output=[id#145])
+- ConvertToSafe
+- Project [id#145]
+- Filter (usr#152 = AA0YP)
+- Scan OrcRelation[id#145,usr#152] InputPaths: 
maprfs:///user/hive/warehouse/hlogsv5, PushedFilters: [EqualTo(usr,AA0YP)]
 when i read this as hive Table 
 scala> sqlContext.sql ( "selecT id from hlogsv5 where cdt=20171003 and 
usrpartkey = 'hhhUsers' and usr='AA0YP' order by id desc limit 10" ).explain
17/10/24 16:11:32 INFO ParseDriver: Parsing command: selecT id from hlogsv5 
where cdt=20171003 and usrpartkey = 'hhhUsers' and usr='AA0YP' order by id desc 
limit 10
17/10/24 16:11:32 INFO ParseDriver: Parse Completed
17/10/24 16:11:32 INFO MemoryStore: Block broadcast_8 stored as values in 
memory (estimated size 399.1 KB, free 1074.6 KB)
17/10/24 16:11:32 INFO MemoryStore: Block broadcast_8_piece0 stored as bytes in 
memory (estimated size 42.7 KB, free 1117.2 KB)
17/10/24 16:11:32 INFO BlockManagerInfo: Added broadcast_8_piece0 in memory on 
172.21.158.61:43493 (size: 42.7 KB, free: 511.4 MB)
17/10/24 16:11:32 INFO SparkContext: Created broadcast 8 from explain at 
:33
== Physical Plan ==
TakeOrderedAndProject(limit=10, orderBy=[id#192 DESC], output=[id#192])
+- ConvertToSafe
+- Project [id#192]
+- Filter (usr#199 = AA0YP)
+- HiveTableScan [id#192,usr#199], MetastoreRelation default, hlogsv5, 
None, [(cdt#189 = 20171003),(usrpartkey#191 = hhhUsers)]
  please let me know if i am missing anything here. thank you 

On Monday, October 23, 2017 1:56 PM, Siva Gudavalli  
wrote:
 

 Hello, I am working with Spark SQL to query Hive Managed Table (in Orc Format) 
I have my data organized by partitions and asked to set indexes for each 50,000 
Rows by setting ('orc.row.index.stride'='5')  lets say -> after evaluating 
partition there are around 50 files in which data is organized. Each file 
contains data specific to one given "cat" and I have set up a bloom filter on 
cat. 
my spark SQL query looks like this -> select * from logs where cdt= 20171002 
and catpartkey= others and usrpartkey= logUsers and cat = 24;
 I have set following property in my spark Sql context and assuming this will 
push down the filters 
sqlContext.setConf("spark.sql.orc.filterPushdown", "true")
 Never my filters are being pushed down. and it seems like partition pruning is 
happening on all files. I dont 

Re: Spark streaming for CEP

2017-10-24 Thread Thomas Bailet

Hi

we (@ hurence) have released on open source middleware based on 
SparkStreaming over Kafka to do CEP and log mining, called *logisland* 
(https://github.com/Hurence/logisland/) it has been deployed into 
production for 2 years now and does a great job. You should have a look.



bye

Thomas Bailet

CTO : hurence


Le 18/10/17 à 22:05, Mich Talebzadeh a écrit :
As you may be aware the granularity that Spark streaming has is 
micro-batching and that is limited to 0.5 second. So if you have 
continuous ingestion of data then Spark streaming may not be granular 
enough for CEP. You may consider other products.


Worth looking at this old thread on mine "Spark support for Complex 
Event Processing (CEP)


https://mail-archives.apache.org/mod_mbox/spark-user/201604.mbox/%3CCAJ3fcbB8eaf0JV84bA7XGUK5GajC1yGT3ZgTNCi8arJg56=l...@mail.gmail.com%3E

HTH


Dr Mich Talebzadeh

LinkedIn 
/https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw/


http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk.Any and all responsibility for 
any loss, damage or destruction of data or any other property which 
may arise from relying on this email's technical content is explicitly 
disclaimed. The author will in no case be liable for any monetary 
damages arising from such loss, damage or destruction.



On 18 October 2017 at 20:52, anna stax > wrote:


Hello all,

Has anyone used spark streaming for CEP (Complex Event
processing).  Any CEP libraries that works well with spark. I have
a use case for CEP and trying to see if spark streaming is a good
fit.

Currently we have a data pipeline using Kafka, Spark streaming and
Cassandra for data ingestion and near real time dashboard.

Please share your experience.
Thanks much.
-Anna







Databricks Certification Registration

2017-10-24 Thread sanat kumar Patnaik
Hello All,

Can anybody here please provide me a link to register for Databricks Spark
developer certification(US based). I have been googling but always end up
with this page at end:

http://www.oreilly.com/data/sparkcert.html?cmp=ex-data-confreg-lp-na_databricks&__hssc=249029528.5.1508846982378&__hstc=249029528.4931db1bbf4dc75e6f600fda8e0dd9ed.1505493209384.1508604695490.1508846982378.5&__hsfp=1916184901&hsCtaTracking=1c78b453-7a67-41f9-9140-194bd9d477c2%7C01c9252f-c93b-407b-bdf2-ed82dd18e69b

There is no link which would take me to the next. I have tried contacting
O'reilly and they asked me to contact Databricks. Trying to get through to
Databricks as well in parallel.

Any help here is much appreciated.

-- 
Regards,
Sanat Patnaik
Cell->804-882-6424


Re: Zero Coefficient in logistic regression

2017-10-24 Thread Alexis Peña
Thanks,  8/10 coeff are zero estimate in CRUZADAS, the parameters for alpha and 
lambda are set in default(i think  zero, the model in R and SAS was fitted 
using glm binary logistic.

 

Cheers

 

De: Simon Dirmeier 
Fecha: martes, 24 de octubre de 2017, 08:30
Para: Alexis Peña , 
Asunto: Re: Zero Coefficient in logistic regression

 

So, all the coefficients are the same but  for CRUZADAS? How are you fitting 
the model in R (glm)?  Can you try setting zero penalty for alpha and lambda:
  .setRegParam(0)
  .setElasticNetParam(0)
Cheers,
S

Am 24.10.17 um 13:19 schrieb Alexis Peña:

Thanks for your Answer, the features “Cruzadas” are Binaries (0/1). The chisq 
statistic must be work whit 2x2 tables.

 

i fit the model in SAS and R and both the coeff have estimates (not 
significant). Two of this kind of features has estimations

 

CRUZADAS49070,247624087
CRUZADAS5304-0,161424508

 

 

Thanks

 

 

De: Weichen Xu 
Fecha: martes, 24 de octubre de 2017, 07:23
Para: Alexis Peña 
CC: "user @spark" 
Asunto: Re: Zero Coefficient in logistic regression

 

Yes chi-squared statistic only used in categorical features. It looks not 
proper here.

Thanks!

 

On Tue, Oct 24, 2017 at 5:13 PM, Simon Dirmeier  wrote:

Hey,

as far as I know feature selection using the a chi-squared statistic, can only 
be done on categorical features and not on possibly continuous ones?
Furthermore, since your logistic model doesn't use any regularization, you 
should be fine here. So I'd check the ChiSqSeletor and possibly replace it with 
another feature selection method. 

There is however always the chance that your response does not depend on your 
covariables, so you'd estimate a zero coefficient.

Cheers,
Simon


Am 24.10.17 um 04:56 schrieb Alexis Peña:

Hi Guys,

 

We are fitting a Logistic model using the following code.

 

 

val Chisqselector = new 
ChiSqSelector().setNumTopFeatures(10).setFeaturesCol("VECTOR_1").setLabelCol("TARGET").setOutputCol("selectedFeatures")

val assembler = new VectorAssembler().setInputCols(Array("FEATURES", 
"selectedFeatures", "PROM_MESES_DIST", "RECENCIA", "TEMP_MIN", "TEMP_MAX", 
"PRECIPITACIONES")).setOutputCol("Union")

val lr = new LogisticRegression().setLabelCol("TARGET").setFeaturesCol("Union")

val pipeline = new Pipeline().setStages(Array(Chisqselector, assembler, lr))

 

 

do you know why the coeff for  the following features are zero estimate, is it  
produced in ChisqSelector or Logistic model?

 

Thanks in advance!!

 

 

CODIGOPARAMETROCOEFICIENTES_MUESTREO_BALANCEADO
PROPIASCV_UM0,276866756
PROPIASCV_U3M-0,241851427
PROPIASCV_U6M-0,568312819
PROPIASCV_U12M0,134706601
PROPIASM_UM5,47E-06
PROPIASM_U3M-7,10E-06
PROPIASM_U6M1,73E-05
PROPIASM_U12M-5,41E-06
PROPIASCP_UM-0,050750105
PROPIASCP_U3M0,125483162
PROPIASCP_U6M-0,353906788
PROPIASCP_U12M0,159538155
PROPIASTUM-0,020217902
PROPIASTU3M0,002101906
PROPIASTU6M-0,005481915
PROPIASTU12M0,003443081
CRUZADAS23030
CRUZADAS39010
CRUZADAS39050
CRUZADAS39070
CRUZADAS39090
CRUZADAS41020
CRUZADAS43070
CRUZADAS45010
CRUZADAS49070,247624087
CRUZADAS5304-0,161424508
LPPROM_MESES_DIST-0,680356554
PROPIASRECENCIA-0,00289069
EXTERNASTEMP_MIN0,006488683
EXTERNASTEMP_MAX-0,013497441
EXTERNASPRECIPITACIONES-0,007607086
INTERCEPTO2,401593191

 

 

 






Re: Zero Coefficient in logistic regression

2017-10-24 Thread Simon Dirmeier
So, all the coefficients are the same but  for CRUZADAS? How are you 
fitting the model in R (glm)?  Can you try setting zero penalty for 
alpha and lambda:


  .setRegParam(0)
  .setElasticNetParam(0)

Cheers,
S


Am 24.10.17 um 13:19 schrieb Alexis Peña:


Thanks for your Answer, the features “Cruzadas” are Binaries (0/1). 
The chisq statistic must be work whit 2x2 tables.


i fit the model in SAS and R and both the coeff have estimates (not 
significant). Two of this kind of features has estimations


CRUZADAS



4907



0,247624087

CRUZADAS



5304



-0,161424508

Thanks

*De: *Weichen Xu 
*Fecha: *martes, 24 de octubre de 2017, 07:23
*Para: *Alexis Peña 
*CC: *"user @spark" 
*Asunto: *Re: Zero Coefficient in logistic regression

Yes chi-squared statistic only used in categorical features. It looks 
not proper here.


Thanks!

On Tue, Oct 24, 2017 at 5:13 PM, Simon Dirmeier > wrote:


Hey,

as far as I know feature selection using the a chi-squared
statistic, can only be done on categorical features and not on
possibly continuous ones?
Furthermore, since your logistic model doesn't use any
regularization, you should be fine here. So I'd check the
ChiSqSeletor and possibly replace it with another feature
selection method.

There is however always the chance that your response does not
depend on your covariables, so you'd estimate a zero coefficient.

Cheers,
Simon

Am 24.10.17 um 04:56 schrieb Alexis Peña:

Hi Guys,

We are fitting a Logistic model using the following code.

val Chisqselector = new

ChiSqSelector().setNumTopFeatures(10).setFeaturesCol("VECTOR_1").setLabelCol("TARGET").setOutputCol("selectedFeatures")

val assembler = new
VectorAssembler().setInputCols(Array("FEATURES",
"selectedFeatures", "PROM_MESES_DIST", "RECENCIA", "TEMP_MIN",
"TEMP_MAX", "PRECIPITACIONES")).setOutputCol("Union")

val lr = new
LogisticRegression().setLabelCol("TARGET").setFeaturesCol("Union")

val pipeline = new Pipeline().setStages(Array(Chisqselector,
assembler, lr))

do you know why the coeff for  the following features are zero
estimate, is it  produced in ChisqSelector or Logistic model?

Thanks in advance!!

CODIGO



PARAMETRO



COEFICIENTES_MUESTREO_BALANCEADO

PROPIAS



CV_UM



0,276866756

PROPIAS



CV_U3M



-0,241851427

PROPIAS



CV_U6M



-0,568312819

PROPIAS



CV_U12M



0,134706601

PROPIAS



M_UM



5,47E-06

PROPIAS



M_U3M



-7,10E-06

PROPIAS



M_U6M



1,73E-05

PROPIAS



M_U12M



-5,41E-06

PROPIAS



CP_UM



-0,050750105

PROPIAS



CP_U3M



0,125483162

PROPIAS



CP_U6M



-0,353906788

PROPIAS



CP_U12M



0,159538155

PROPIAS



TUM



-0,020217902

PROPIAS



TU3M



0,002101906

PROPIAS



TU6M



-0,005481915

PROPIAS



TU12M



0,003443081

CRUZADAS



2303



0

CRUZADAS



3901



0

CRUZADAS



3905



0

CRUZADAS



3907



0

CRUZADAS



3909



0

CRUZADAS



4102



0

CRUZADAS



4307



0

CRUZADAS



4501



0

CRUZADAS



4907



0,247624087

CRUZADAS



5304



-0,161424508

LP



PROM_MESES_DIST



-0,680356554

PROPIAS



RECENCIA



-0,00289069

EXTERNAS



TEMP_MIN



0,0064

Re: Zero Coefficient in logistic regression

2017-10-24 Thread Alexis Peña
Thanks for your Answer, the features “Cruzadas” are Binaries (0/1). The chisq 
statistic must be work whit 2x2 tables.

 

i fit the model in SAS and R and both the coeff have estimates (not 
significant). Two of this kind of features has estimations

 

CRUZADAS49070,247624087
CRUZADAS5304-0,161424508

 

 

Thanks

 

 

De: Weichen Xu 
Fecha: martes, 24 de octubre de 2017, 07:23
Para: Alexis Peña 
CC: "user @spark" 
Asunto: Re: Zero Coefficient in logistic regression

 

Yes chi-squared statistic only used in categorical features. It looks not 
proper here.

Thanks!

 

On Tue, Oct 24, 2017 at 5:13 PM, Simon Dirmeier  wrote:

Hey,

as far as I know feature selection using the a chi-squared statistic, can only 
be done on categorical features and not on possibly continuous ones?
Furthermore, since your logistic model doesn't use any regularization, you 
should be fine here. So I'd check the ChiSqSeletor and possibly replace it with 
another feature selection method. 

There is however always the chance that your response does not depend on your 
covariables, so you'd estimate a zero coefficient.

Cheers,
Simon

Am 24.10.17 um 04:56 schrieb Alexis Peña:

Hi Guys,

 

We are fitting a Logistic model using the following code.

 

 

val Chisqselector = new 
ChiSqSelector().setNumTopFeatures(10).setFeaturesCol("VECTOR_1").setLabelCol("TARGET").setOutputCol("selectedFeatures")

val assembler = new VectorAssembler().setInputCols(Array("FEATURES", 
"selectedFeatures", "PROM_MESES_DIST", "RECENCIA", "TEMP_MIN", "TEMP_MAX", 
"PRECIPITACIONES")).setOutputCol("Union")

val lr = new LogisticRegression().setLabelCol("TARGET").setFeaturesCol("Union")

val pipeline = new Pipeline().setStages(Array(Chisqselector, assembler, lr))

 

 

do you know why the coeff for  the following features are zero estimate, is it  
produced in ChisqSelector or Logistic model?

 

Thanks in advance!!

 

 

CODIGOPARAMETROCOEFICIENTES_MUESTREO_BALANCEADO
PROPIASCV_UM0,276866756
PROPIASCV_U3M-0,241851427
PROPIASCV_U6M-0,568312819
PROPIASCV_U12M0,134706601
PROPIASM_UM5,47E-06
PROPIASM_U3M-7,10E-06
PROPIASM_U6M1,73E-05
PROPIASM_U12M-5,41E-06
PROPIASCP_UM-0,050750105
PROPIASCP_U3M0,125483162
PROPIASCP_U6M-0,353906788
PROPIASCP_U12M0,159538155
PROPIASTUM-0,020217902
PROPIASTU3M0,002101906
PROPIASTU6M-0,005481915
PROPIASTU12M0,003443081
CRUZADAS23030
CRUZADAS39010
CRUZADAS39050
CRUZADAS39070
CRUZADAS39090
CRUZADAS41020
CRUZADAS43070
CRUZADAS45010
CRUZADAS49070,247624087
CRUZADAS5304-0,161424508
LPPROM_MESES_DIST-0,680356554
PROPIASRECENCIA-0,00289069
EXTERNASTEMP_MIN0,006488683
EXTERNASTEMP_MAX-0,013497441
EXTERNASPRECIPITACIONES-0,007607086
INTERCEPTO2,401593191

 

 

 



Re: Zero Coefficient in logistic regression

2017-10-24 Thread Weichen Xu
Yes chi-squared statistic only used in categorical features. It looks not
proper here.
Thanks!

On Tue, Oct 24, 2017 at 5:13 PM, Simon Dirmeier 
wrote:

> Hey,
> as far as I know feature selection using the a chi-squared statistic, can
> only be done on categorical features and not on possibly continuous ones?
> Furthermore, since your logistic model doesn't use any regularization, you
> should be fine here. So I'd check the ChiSqSeletor and possibly replace it
> with another feature selection method.
>
> There is however always the chance that your response does not depend on
> your covariables, so you'd estimate a zero coefficient.
>
> Cheers,
> Simon
>
>
> Am 24.10.17 um 04:56 schrieb Alexis Peña:
>
> Hi Guys,
>
>
>
> We are fitting a Logistic model using the following code.
>
>
>
>
>
> val Chisqselector = new ChiSqSelector().setNumTopFeatures(10).
> setFeaturesCol("VECTOR_1").setLabelCol("TARGET").setOutputCol("
> selectedFeatures")
>
> val assembler = new VectorAssembler().setInputCols(Array("FEATURES",
> "selectedFeatures", "PROM_MESES_DIST", "RECENCIA", "TEMP_MIN", "TEMP_MAX",
> "PRECIPITACIONES")).setOutputCol("Union")
>
> val lr = new LogisticRegression().setLabelCol("TARGET").
> setFeaturesCol("Union")
>
> val pipeline = new Pipeline().setStages(Array(Chisqselector, assembler,
> lr))
>
>
>
>
>
> do you know why the coeff for  the following features are zero estimate,
> is it  produced in ChisqSelector or Logistic model?
>
>
>
> Thanks in advance!!
>
>
>
>
>
> CODIGO
>
> PARAMETRO
>
> COEFICIENTES_MUESTREO_BALANCEADO
>
> PROPIAS
>
> CV_UM
>
> 0,276866756
>
> PROPIAS
>
> CV_U3M
>
> -0,241851427
>
> PROPIAS
>
> CV_U6M
>
> -0,568312819
>
> PROPIAS
>
> CV_U12M
>
> 0,134706601
>
> PROPIAS
>
> M_UM
>
> 5,47E-06
>
> PROPIAS
>
> M_U3M
>
> -7,10E-06
>
> PROPIAS
>
> M_U6M
>
> 1,73E-05
>
> PROPIAS
>
> M_U12M
>
> -5,41E-06
>
> PROPIAS
>
> CP_UM
>
> -0,050750105
>
> PROPIAS
>
> CP_U3M
>
> 0,125483162
>
> PROPIAS
>
> CP_U6M
>
> -0,353906788
>
> PROPIAS
>
> CP_U12M
>
> 0,159538155
>
> PROPIAS
>
> TUM
>
> -0,020217902
>
> PROPIAS
>
> TU3M
>
> 0,002101906
>
> PROPIAS
>
> TU6M
>
> -0,005481915
>
> PROPIAS
>
> TU12M
>
> 0,003443081
>
> CRUZADAS
>
> 2303
>
> 0
>
> CRUZADAS
>
> 3901
>
> 0
>
> CRUZADAS
>
> 3905
>
> 0
>
> CRUZADAS
>
> 3907
>
> 0
>
> CRUZADAS
>
> 3909
>
> 0
>
> CRUZADAS
>
> 4102
>
> 0
>
> CRUZADAS
>
> 4307
>
> 0
>
> CRUZADAS
>
> 4501
>
> 0
>
> CRUZADAS
>
> 4907
>
> 0,247624087
>
> CRUZADAS
>
> 5304
>
> -0,161424508
>
> LP
>
> PROM_MESES_DIST
>
> -0,680356554
>
> PROPIAS
>
> RECENCIA
>
> -0,00289069
>
> EXTERNAS
>
> TEMP_MIN
>
> 0,006488683
>
> EXTERNAS
>
> TEMP_MAX
>
> -0,013497441
>
> EXTERNAS
>
> PRECIPITACIONES
>
> -0,007607086
>
> INTERCEPTO
>
> 2,401593191
>
>
>
>
>


Re: Zero Coefficient in logistic regression

2017-10-24 Thread Simon Dirmeier

Hey,

as far as I know feature selection using the a chi-squared statistic, 
can only be done on categorical features and not on possibly continuous 
ones?
Furthermore, since your logistic model doesn't use any regularization, 
you should be fine here. So I'd check the ChiSqSeletor and possibly 
replace it with another feature selection method.


There is however always the chance that your response does not depend on 
your covariables, so you'd estimate a zero coefficient.


Cheers,
Simon


Am 24.10.17 um 04:56 schrieb Alexis Peña:


Hi Guys,

We are fitting a Logistic model using the following code.

val Chisqselector = new 
ChiSqSelector().setNumTopFeatures(10).setFeaturesCol("VECTOR_1").setLabelCol("TARGET").setOutputCol("selectedFeatures")


val assembler = new VectorAssembler().setInputCols(Array("FEATURES", 
"selectedFeatures", "PROM_MESES_DIST", "RECENCIA", "TEMP_MIN", 
"TEMP_MAX", "PRECIPITACIONES")).setOutputCol("Union")


val lr = new 
LogisticRegression().setLabelCol("TARGET").setFeaturesCol("Union")


val pipeline = new Pipeline().setStages(Array(Chisqselector, 
assembler, lr))


do you know why the coeff for  the following features are zero 
estimate, is it  produced in ChisqSelector or Logistic model?


Thanks in advance!!

CODIGO



PARAMETRO



COEFICIENTES_MUESTREO_BALANCEADO

PROPIAS



CV_UM



0,276866756

PROPIAS



CV_U3M



-0,241851427

PROPIAS



CV_U6M



-0,568312819

PROPIAS



CV_U12M



0,134706601

PROPIAS



M_UM



5,47E-06

PROPIAS



M_U3M



-7,10E-06

PROPIAS



M_U6M



1,73E-05

PROPIAS



M_U12M



-5,41E-06

PROPIAS



CP_UM



-0,050750105

PROPIAS



CP_U3M



0,125483162

PROPIAS



CP_U6M



-0,353906788

PROPIAS



CP_U12M



0,159538155

PROPIAS



TUM



-0,020217902

PROPIAS



TU3M



0,002101906

PROPIAS



TU6M



-0,005481915

PROPIAS



TU12M



0,003443081

CRUZADAS



2303



0

CRUZADAS



3901



0

CRUZADAS



3905



0

CRUZADAS



3907



0

CRUZADAS



3909



0

CRUZADAS



4102



0

CRUZADAS



4307



0

CRUZADAS



4501



0

CRUZADAS



4907



0,247624087

CRUZADAS



5304



-0,161424508

LP



PROM_MESES_DIST



-0,680356554

PROPIAS



RECENCIA



-0,00289069

EXTERNAS



TEMP_MIN



0,006488683

EXTERNAS



TEMP_MAX



-0,013497441

EXTERNAS



PRECIPITACIONES



-0,007607086

INTERCEPTO




2,401593191





Fwd: Spark 1.x - End of life

2017-10-24 Thread Ismaël Mejía
Thanks for your answer Matei. I agree that a more explicit maintenance
policy is needed (even for the 2.x releases). I did not immediately
find anything about this in the website, so I ended up assuming the
information of the wikipedia article that says that the 1.6.x line is
still maintained.

I see that Spark as an open source project can get updates if the
community brings them in, but it is probably also a good idea to be
clear about the expectations for the end users. I suppose some users
who can migrate to version 2 won’t do it if there is still support
(notice that ‘support’ can be tricky considering how different
companies re-package/maintain Spark but this is a different
discussion). Anyway it would be great to have this defined somewhere.
Maybe worth a discussion on dev@.

On Thu, Oct 19, 2017 at 11:20 PM, Matei Zaharia  wrote:
> Hi Ismael,
>
> It depends on what you mean by “support”. In general, there won’t be new 
> feature releases for 1.X (e.g. Spark 1.7) because all the new features are 
> being added to the master branch. However, there is always room for bug fix 
> releases if there is a catastrophic bug, and committers can make those at any 
> time. In general though, I’d recommend moving workloads to Spark 2.x. We 
> tried to make the migration as easy as possible (a few APIs changed, but not 
> many), and 2.x has been out for a long time now and is widely used.
>
> We should perhaps write a more explicit maintenance policy, but all of this 
> is run based on what committers want to work on; if someone thinks that 
> there’s a serious enough issue in 1.6 to update it, they can put together a 
> new release. It does help to hear from users about this though, e.g. if you 
> think there’s a significant issue that people are missing.
>
> Matei
>
>> On Oct 19, 2017, at 5:20 AM, Ismaël Mejía  wrote:
>>
>> Hello,
>>
>> I noticed that some of the (Big Data / Cloud Managed) Hadoop
>> distributions are starting to (phase out / deprecate) Spark 1.x and I
>> was wondering if the Spark community has already decided when will it
>> end the support for Spark 1.x. I ask this also considering that the
>> latest release in the series is already almost one year old. Any idea
>> on this ?
>>
>> Thanks,
>> Ismaël
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Accessing UI for spark running as kubernetics container on standby name node

2017-10-24 Thread Mohit Gupta
Hi,

We are launching all spark jobs as kubernetics(k8es) containers inside a
k8es cluster. We also create a service on each job and do port forwarding
for the spark UI (container's 4040 is mapped to SvcPort 31123).
The same set of nodes is also hosting a Yarn cluster.
Inside container, we do spark-submit to Yarn in client mode.

Now, the spark container gets launched on any of the name nodes - master or
standby. There is a VIP assigned to the active name node.

When spark container gets launched on active name node, spark UI and all
its tabs are easily accessible from anywhere using VIP:SvcPort
However, when spark container gets launched on standby name node, spark UI
is NOT accessible and eventually request fails with 500 error (as the
VIP:SvcPort gets redirected to ActiveNamenode:8080 which is not accessible
as driver is now running on standby name node.

I have tried to reason and try out multiple service configurations but
nothing seems to work this scenario.

May anyone pls suggest proper k8es service config for this scenario. We do
not want to restrict the spark containers to active name node as let k8es
address the job load within k8es cluster.

Any help is much appreciated.