Re: ML:One vs Rest with crossValidator for multinomial in logistic regression
Nicolas, are you referring to printing the model params in that example with "print(model1.extractParamMap())"? There was a problem with pyspark models not having params after being fit, which causes this example to show nothing for model paramMaps. This was fixed in https://issues.apache.org/jira/browse/SPARK-10931 and the example now shows all model params. The fix will be in the Spark 2.3 release. Bryan On Wed, Jan 31, 2018 at 10:20 PM, Nicolas Pariswrote: > Hey > > I am also interested in how to get those parameters. > For example, the demo code spark-2.2.1-bin-hadoop2.7/ > examples/src/main/python/ml/estimator_transformer_param_example.py > return empty parameters when printing "lr.extractParamMap()" > > That's weird > > Thanks > > Le 30 janv. 2018 à 23:10, Bryan Cutler écrivait : > > Hi Michelle, > > > > Your original usage of ParamGridBuilder was not quite right, `addGrid` > expects > > (some parameter, array of values for that parameter). If you want to do > a grid > > search with different regularization values, you would do the following: > > > > val paramMaps = new ParamGridBuilder().addGrid(logist.regParam, > Array(0.1, > > 0.3)).build() > > > > * don't forget to build the grid after adding values > > > > On Tue, Jan 30, 2018 at 6:55 AM, michelleyang < > michelle1026sh...@gmail.com> > > wrote: > > > > I tried to use One vs Rest in spark ml with pipeline and > crossValidator for > > multimultinomial in logistic regression. > > > > It came out with empty coefficients. I figured out it was the > setting of > > ParamGridBuilder. Can anyone help me understand how does the > parameter > > setting affect the crossValidator process? > > > > the orginal code: //output empty coefficients. > > > > val logist=new LogisticRegression > > > > val ova = new OneVsRest().setClassifier(logist) > > > > val paramMaps = new ParamGridBuilder().addGrid(ova.classifier, > > Array(logist.getRegParam)) > > > > New code://output multi classes coefficients > > > > val logist=new LogisticRegression > > > > val ova = new OneVsRest().setClassifier(logist) > > > > val classifier1 = new LogisticRegression().setRegParam(2.0) > > > > val classifier2 = new LogisticRegression().setRegParam(3.0) > > > > val paramMaps = new ParamGridBuilder() .addGrid(ova.classifier, > > Array(classifier1, classifier2)) > > > > Please help Thanks. > > > > > > > > -- > > Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ > > > > > - > > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > > > > > > >
Apache Spark - Structured Streaming - Updating UDF state dynamically at run time
Hi Spark Experts: I am trying to use a stateful udf with spark structured streaming that needs to update the state periodically. Here is the scenario: 1. I have a udf with a variable with default value (eg: 1) This value is applied to a column (eg: subtract the variable from the column value )2. The variable is to be updated periodically asynchronously (eg: reading a file every 5 minutes) and the new rows will have the new value applied to the column value. Spark natively supports broadcast variables, but I could not find a way to update the broadcasted variables dynamically or rebroadcast them once so that the udf internal state can be updated while the structure streaming application is running. I can try to read the variable from the file on each invocation of the udf but it will not scale since each invocation open/read/close the file. Please let me know if there is any documentation/example to support this scenario. Thanks
Free access to Index Conf for Apache Spark community attendees
Free access to Index Conf for Apache Spark session attendees. For info go to: https://www.meetup.com/SF-Big-Analytic IBM is hosting a developer conference - Essentially the conference is ‘By Developers, for Developers’ based on Open technologies. This will be held Feb 20 - 22nd in Moscone West. http://indexconf.com We have a phenomenal list of speakers for the Spark community as well as participation by the Tensorflow, R, and other communities. https://developer.ibm.com/indexconf/communities/ Register using this Promo Code to get your free access. Usage Instructions: Follow this link: https://www.ibm.com/events/wwe/indexconf/indexconf18.nsf/Registration.xsp?open Select “Attendee” as your Registration Type Enter your Registration Promotion Code*: IND18FULL *Restrictions: Promotion Code expires February 12th, 11:59PM Pacific Government Owned Entities (GOE’s) not eligible *If you have previously registered, please reach out to Jeff Borek (jbo...@us.ibm.com) to take advantage of the new discount code.* Detailed Agenda for Spark Community Day Feb 20th 2:00 - 2:30 What the community does in the coming release Spark 2.3. Sean Li, Apache Spark committer & PMC member from Databricks There are many great features added to Apache Spark. This talk is to provide a preview of the new features and updates in the coming release Spark 2.3. 2:30 - 3:00 Data Warehouse Features in Spark SQL Ioana Ursu, IBM Lead contributor on SparkSQL This talk covers advanced Spark SQL features for data warehouse such as star-schema optimizations and informational constraints support. Star-schema consists of a fact table referencing a number of dimension tables. Fact and dimension tables are in a primary key – foreign key relationship. An informational or statistical constraint can be used by Spark to improve query performance. 3:00- 3:30 Building an Enterprise/Cloud Analytics Platform with Jupyter Notebooks and Apache Spark Frederick Reiss Chief architect of IBM Spark Technology Center Data Scientists are becoming a necessity of every company in the data-centric world of today, and with them comes the requirement to make available a flexible and interactive analytics platform that exposes Notebook services at web scale. In this session we will describe our experience and best practices building the platform, in particular how we built the Enterprise Gateway that enables all the Notebooks to share the Spark cluster computational resources. 3:45-4:15 The State of Spark MLlib and New Scalability Features in 2.3 Nick Pentreath, Spark committer & PMC member This talk will give an overview of Spark’s machine learning library, MLlib. The new 2.3 release of Spark brings some exciting scalability enhancements to MLlib, which we will explore in depth, including parallel cross-validation and performance improvements for larger-scale datasets through adding multi-column support to the most widely-used Spark transformers. 4:15 - 5:15 Spark and AI Nick Pentreath & Fred Reiss This session will be an open discussion of the role of Spark within the AI landscape, and what the future holds for AI / deep learning on Spark. In recent years specialized systems (such as TensorFlow, Caffe, PyTorch and MXNet) have been dominant in the domain of AI and deep learning. While there are a few deep learning frameworks that are Spark specific, often these frameworks are separate from Spark and the ease of integration and feature set exposed varies considerably. 5:15 - 6:00 HSpark: enable Spark SQL query on NoSQL Hbase tables Bo Meng, IBM Spark contributor, Yan Zhou IBM Hadoop Architect HBase is a NoSQL data source which allows flexible data storage and access mechanisms. While leveraging Spark’s high scalable framework and programming interface, we added SQL capability to HBase and an easy of use interface for data scientists and traditional analysts. We will discuss how we implement HSpark by leveraging Spark SQL parser, mapping different data types, pushing down the predicates to HBase and improving the query performance - Xin Wu | @xwu0226 Spark Technology Center -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Unsubscribe
Re: Are there any alternatives to Hive "stored by" clause as Spark 2.0 does not support it
Hi Jacek https://cwiki.apache.org/confluence/display/Hive/StorageHandlers The motivation is to make it possible to allow Hive to access data stored and managed by other systems in a modular, extensible fashion. I have hive script which have custom storage handler , something like this create table 1. CREATE EXTERNAL TABLE $temp_output 2. ( 3. data String 4. ) 5. STORED BY 'ABCStorageHandler' LOCATION '$table_location' TBLPROPERTIES ( 6. 7. ); when I migrate to Spark it says STORED BY operation is not permitted. Regards Pralabh Kumar On Thu, Feb 8, 2018 at 6:28 PM, Jacek Laskowskiwrote: > Hi, > > Since I'm new to Hive, what does `stored by` do? I might help a bit in > Spark if I only knew a bit about Hive :) > > Pozdrawiam, > Jacek Laskowski > > https://about.me/JacekLaskowski > Mastering Spark SQL https://bit.ly/mastering-spark-sql > Spark Structured Streaming https://bit.ly/spark-structured-streaming > Mastering Kafka Streams https://bit.ly/mastering-kafka-streams > Follow me at https://twitter.com/jaceklaskowski > > On Thu, Feb 8, 2018 at 7:25 AM, Pralabh Kumar > wrote: > >> Hi >> >> Spark 2.0 doesn't support stored by . Is there any alternative to achieve >> the same. >> >> >> >
Re: Are there any alternatives to Hive "stored by" clause as Spark 2.0 does not support it
Hi, Since I'm new to Hive, what does `stored by` do? I might help a bit in Spark if I only knew a bit about Hive :) Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski Mastering Spark SQL https://bit.ly/mastering-spark-sql Spark Structured Streaming https://bit.ly/spark-structured-streaming Mastering Kafka Streams https://bit.ly/mastering-kafka-streams Follow me at https://twitter.com/jaceklaskowski On Thu, Feb 8, 2018 at 7:25 AM, Pralabh Kumarwrote: > Hi > > Spark 2.0 doesn't support stored by . Is there any alternative to achieve > the same. > > >
Spark conf forgets cassandra host in the configuration file
Hello, I am facing an issue with Spark Conf while reading the Cassandra host property from the default spark configuration file. I use Kafka 2.11.0.10 and Spark 2.2.1, Cassandra 3.11. I have a Docker container where spark master, worker and my app running as standalone cluster mode. I have a spark app using structured streaming pipeline reading from Kafka and storing to Cassandra. I use for each sink to write the streaming data to Cassandra. In for each writer, my application forgets the cassandra host specified in spark defaults. I print out the Spark config when the Spark application starts and I see that the application reads spark.cassandra.connection.host and prints out properly. However, when I start publishing messages to Kafka and for each writer triggered, I see that it tries to connect localhost by default and forgets/does not read the host in the Spark defaults. Any idea what I am dealing with here. I am out of options. The source code is available here: Application: https://github.com/ibyrktr/metric/blob/master/metricconsumer/src/main/java/eu/dipherential/metricconsumer/stream/MetricEventConsumer.java For each writer: (to workaround I set it manually again) https://github.com/ibyrktr/metric/blob/master/metricconsumer/src/main/java/eu/dipherential/metricconsumer/store/HourlyMetricStatsByIdSink.java Thanks. Regards, Ismail Bayraktar Sent by mobile.