Re: ML:One vs Rest with crossValidator for multinomial in logistic regression

2018-02-08 Thread Bryan Cutler
Nicolas, are you referring to printing the model params in that example
with "print(model1.extractParamMap())"?  There was a problem with pyspark
models not having params after being fit, which causes this example to show
nothing for model paramMaps.  This was fixed in
https://issues.apache.org/jira/browse/SPARK-10931 and the example now shows
all model params.  The fix will be in the Spark 2.3 release.

Bryan

On Wed, Jan 31, 2018 at 10:20 PM, Nicolas Paris  wrote:

> Hey
>
> I am also interested in how to get those parameters.
> For example, the demo code spark-2.2.1-bin-hadoop2.7/
> examples/src/main/python/ml/estimator_transformer_param_example.py
> return empty parameters when  printing "lr.extractParamMap()"
>
> That's weird
>
> Thanks
>
> Le 30 janv. 2018 à 23:10, Bryan Cutler écrivait :
> > Hi Michelle,
> >
> > Your original usage of ParamGridBuilder was not quite right, `addGrid`
> expects
> > (some parameter, array of values for that parameter).  If you want to do
> a grid
> > search with different regularization values, you would do the following:
> >
> > val paramMaps = new ParamGridBuilder().addGrid(logist.regParam,
> Array(0.1,
> > 0.3)).build()
> >
> > * don't forget to build the grid after adding values
> >
> > On Tue, Jan 30, 2018 at 6:55 AM, michelleyang <
> michelle1026sh...@gmail.com>
> > wrote:
> >
> > I tried to use One vs Rest in spark ml with pipeline and
> crossValidator for
> > multimultinomial in logistic regression.
> >
> > It came out with empty coefficients. I figured out it was the
> setting of
> > ParamGridBuilder. Can anyone help me understand how does the
> parameter
> > setting affect the crossValidator process?
> >
> > the orginal code: //output empty coefficients.
> >
> > val logist=new LogisticRegression
> >
> > val ova = new OneVsRest().setClassifier(logist)
> >
> > val paramMaps = new ParamGridBuilder().addGrid(ova.classifier,
> > Array(logist.getRegParam))
> >
> > New code://output multi classes coefficients
> >
> > val logist=new LogisticRegression
> >
> > val ova = new OneVsRest().setClassifier(logist)
> >
> > val classifier1 = new LogisticRegression().setRegParam(2.0)
> >
> > val classifier2 = new LogisticRegression().setRegParam(3.0)
> >
> > val paramMaps = new ParamGridBuilder() .addGrid(ova.classifier,
> > Array(classifier1, classifier2))
> >
> > Please help Thanks.
> >
> >
> >
> > --
> > Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
> >
> > 
> -
> > To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> >
> >
> >
>


Apache Spark - Structured Streaming - Updating UDF state dynamically at run time

2018-02-08 Thread M Singh
Hi Spark Experts:
I am trying to use a stateful udf with spark structured streaming that needs to 
update the state periodically.
Here is the scenario:
1. I have a udf with a variable with default value (eg: 1)  This value is 
applied to a column (eg: subtract the variable from the column value )2. The 
variable is to be updated periodically asynchronously (eg: reading a file every 
5 minutes) and the new rows will have the new value applied to the column value.
Spark natively supports broadcast variables, but I could not find a way to 
update the broadcasted variables dynamically or rebroadcast them once so that 
the udf internal state can be updated while the structure streaming application 
is running.
I can try to read the variable from the file on each invocation of the udf but 
it will not scale since each invocation open/read/close the file.
Please let me know if there is any documentation/example to support this 
scenario.
Thanks





Free access to Index Conf for Apache Spark community attendees

2018-02-08 Thread xwu0226
Free access to Index Conf for Apache Spark session attendees. For info go to:
https://www.meetup.com/SF-Big-Analytic

IBM is hosting a developer conference - Essentially the conference is ‘By
Developers, for Developers’ based on Open technologies.
This will be held Feb 20 - 22nd in Moscone West.
http://indexconf.com

We have a phenomenal list of speakers for the Spark community as well as
participation by the Tensorflow, R, and other communities.
https://developer.ibm.com/indexconf/communities/ Register using this Promo
Code to get your free access.
Usage Instructions:
Follow this link:
https://www.ibm.com/events/wwe/indexconf/indexconf18.nsf/Registration.xsp?open
Select “Attendee” as your Registration Type
Enter your Registration Promotion Code*: IND18FULL

*Restrictions:
Promotion Code expires February 12th, 11:59PM Pacific
Government Owned Entities (GOE’s) not eligible

*If you have previously registered, please reach out to Jeff Borek
(jbo...@us.ibm.com) to take advantage of the new discount code.*

Detailed Agenda for Spark Community Day Feb 20th

2:00 - 2:30
What the community does in the coming release Spark 2.3.
Sean Li, Apache Spark committer & PMC member from Databricks

There are many great features added to Apache Spark. This talk is to provide
a preview of the new features and updates in the coming release Spark 2.3.

2:30 - 3:00
Data Warehouse Features in Spark SQL
Ioana Ursu, IBM Lead contributor on SparkSQL

This talk covers advanced Spark SQL features for data warehouse such as
star-schema optimizations and informational constraints support. Star-schema
consists of a fact table referencing a number of dimension tables. Fact and
dimension tables are in a primary key – foreign key relationship. An
informational or statistical constraint can be used by Spark to improve
query performance.

3:00- 3:30
Building an Enterprise/Cloud Analytics Platform with Jupyter Notebooks and
Apache Spark
Frederick Reiss Chief architect of IBM Spark Technology Center

Data Scientists are becoming a necessity of every company in the
data-centric world of today, and with them comes the requirement to make
available a flexible and interactive analytics platform that exposes
Notebook services at web scale. In this session we will describe our
experience and best practices building the platform, in particular how we
built the Enterprise Gateway that enables all the Notebooks to share the
Spark cluster computational resources. 3:45-4:15
The State of Spark MLlib and New Scalability Features in 2.3
Nick Pentreath, Spark committer & PMC member

This talk will give an overview of Spark’s machine learning library, MLlib.
The new 2.3 release of Spark brings some exciting scalability enhancements
to MLlib, which we will explore in depth, including parallel
cross-validation and performance improvements for larger-scale datasets
through adding multi-column support to the most widely-used Spark
transformers.

4:15 - 5:15
Spark and AI
Nick Pentreath & Fred Reiss

This session will be an open discussion of the role of Spark within the AI
landscape, and what the future holds for AI / deep learning on Spark. In
recent years specialized systems (such as TensorFlow, Caffe, PyTorch and
MXNet) have been dominant in the domain of AI and deep learning. While there
are a few deep learning frameworks that are Spark specific, often these
frameworks are separate from Spark and the ease of integration and feature
set exposed varies considerably.

5:15 - 6:00
HSpark: enable Spark SQL query on NoSQL Hbase tables
Bo Meng, IBM Spark contributor, Yan Zhou IBM Hadoop Architect HBase is a
NoSQL data source which allows flexible data storage and access mechanisms.
While leveraging Spark’s high scalable framework and programming interface,
we added SQL capability to HBase and an easy of use interface for data
scientists and traditional analysts. We will discuss how we implement HSpark
by leveraging Spark SQL parser, mapping different data types, pushing down
the predicates to HBase and improving the query performance



-
Xin Wu | @xwu0226
Spark Technology Center
--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Unsubscribe

2018-02-08 Thread Yosef Moatti




Re: Are there any alternatives to Hive "stored by" clause as Spark 2.0 does not support it

2018-02-08 Thread Pralabh Kumar
Hi Jacek

https://cwiki.apache.org/confluence/display/Hive/StorageHandlers

The motivation is to make it possible to allow Hive to access data stored
and managed by other systems in a modular, extensible fashion.


I have hive script which have custom storage handler , something like this


create table


   1. CREATE EXTERNAL TABLE $temp_output
   2. (
   3. data String
   4. )
   5. STORED BY 'ABCStorageHandler' LOCATION '$table_location'
   TBLPROPERTIES (
   6.
   7. );


when I migrate to Spark it says STORED BY operation is not permitted.

Regards
Pralabh Kumar

On Thu, Feb 8, 2018 at 6:28 PM, Jacek Laskowski  wrote:

> Hi,
>
> Since I'm new to Hive, what does `stored by` do? I might help a bit in
> Spark if I only knew a bit about Hive :)
>
> Pozdrawiam,
> Jacek Laskowski
> 
> https://about.me/JacekLaskowski
> Mastering Spark SQL https://bit.ly/mastering-spark-sql
> Spark Structured Streaming https://bit.ly/spark-structured-streaming
> Mastering Kafka Streams https://bit.ly/mastering-kafka-streams
> Follow me at https://twitter.com/jaceklaskowski
>
> On Thu, Feb 8, 2018 at 7:25 AM, Pralabh Kumar 
> wrote:
>
>> Hi
>>
>> Spark 2.0 doesn't support stored by . Is there any alternative to achieve
>> the same.
>>
>>
>>
>


Re: Are there any alternatives to Hive "stored by" clause as Spark 2.0 does not support it

2018-02-08 Thread Jacek Laskowski
Hi,

Since I'm new to Hive, what does `stored by` do? I might help a bit in
Spark if I only knew a bit about Hive :)

Pozdrawiam,
Jacek Laskowski

https://about.me/JacekLaskowski
Mastering Spark SQL https://bit.ly/mastering-spark-sql
Spark Structured Streaming https://bit.ly/spark-structured-streaming
Mastering Kafka Streams https://bit.ly/mastering-kafka-streams
Follow me at https://twitter.com/jaceklaskowski

On Thu, Feb 8, 2018 at 7:25 AM, Pralabh Kumar 
wrote:

> Hi
>
> Spark 2.0 doesn't support stored by . Is there any alternative to achieve
> the same.
>
>
>


Spark conf forgets cassandra host in the configuration file

2018-02-08 Thread Ismail Bayraktar
Hello,
I am facing an issue with Spark Conf while reading the Cassandra host property 
from the default spark configuration file.

I use Kafka 2.11.0.10 and Spark 2.2.1, Cassandra 3.11. I have a Docker 
container where spark master, worker and my app running as standalone cluster 
mode. I have a spark app using structured streaming pipeline reading from Kafka 
and storing to Cassandra. I use for each sink to write the streaming data to 
Cassandra. In for each writer, my application forgets the cassandra host 
specified in spark defaults. I print out the Spark config when the Spark 
application starts and I see that the application reads 
spark.cassandra.connection.host and prints out properly.

However, when I start publishing messages to Kafka and for each writer 
triggered, I see that it tries to connect localhost by default and forgets/does 
not read the host in the Spark defaults.

Any idea what I am dealing with here. I am out of options. 

The source code is available here: 
Application: 
https://github.com/ibyrktr/metric/blob/master/metricconsumer/src/main/java/eu/dipherential/metricconsumer/stream/MetricEventConsumer.java

For each writer: (to workaround I set it manually again)

https://github.com/ibyrktr/metric/blob/master/metricconsumer/src/main/java/eu/dipherential/metricconsumer/store/HourlyMetricStatsByIdSink.java

Thanks.
Regards,
Ismail Bayraktar 
Sent by mobile.