[Spark-sql 3.2.4] Wrong Statistic INFO From 'ANALYZE TABLE' Command

2023-11-24 Thread Nick Luo
Hi, all The ANALYZE TABLE command run from Spark on a Hive table. Question: Before I run ANALYZE TABLE' Command on Spark-sql client, I ran 'ANALYZE TABLE' Command on Hive client, the wrong Statistic Info show up. For example 1. run the analyze table command o hive client - create table test_a

apache-spark

2021-10-14 Thread Nick Shivhare
Hi All, We are facing an issue and would be thankful if anyone can help us on this issue. Environment: Spark, Kubernetes and Airflow. Airflow is used to schedule job spark job over kubernetes. We are using bash script which is using spark submit command to submit spark jobs. Issue: We are submitt

Re: Spark AQE Post-Shuffle partitions coalesce don't work as expected, and even make data skew in some partitions. Need help to debug issue.

2021-07-05 Thread Nick Grigoriev
oalesce functions. But I am not sure, so I split all my debug way into step and put it in mail. Another issue, that I don’t have a public dataset where I can reproduce this issue, right now. And I can’t publish my current dataset out of company, because if privacy. > On 4 Jul 2021, at 16:44, Mi

Spark AQE Post-Shuffle partitions coalesce don't work as expected, and even make data skew in some partitions. Need help to debug issue.

2021-07-04 Thread Nick Grigoriev
I have ask this question on stack overflow, but it look to complex for Q/A resource. https://stackoverflow.com/questions/68236323/spark-aqe-post-shuffle-partitions-coalesce-dont-work-as-expected-and-even-make So I want ask for help here. I use global sort on my spark DF, and when I enable AQE and

Loading Hadoop-Azure in Kubernetes

2021-04-16 Thread Nick Stenroos-Dam
Hello I am trying to load the Hadoop-Azure driver in Apache Spark, but so far I have failed. The plan is to include the required files in the docker image, as we plan on using a Client-mode SparkSession. My current Dockerfile looks like this: FROM spark:latest

PySpark .collect() output to Scala Array[Row]

2020-05-25 Thread Nick Ruest
Hi, I've hit a wall with trying to implement a couple of Scala methods in a Python version of our project. I've implemented a number of these already, but I'm getting hung up with this one. My Python function looks like this: def Write_Graphml(data, graphml_path, sc): return sc.getOrCreate()

Re: Extract value from streaming Dataframe to a variable

2020-01-21 Thread Nick Dawes
> each output of micro-batch: > > http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#using-foreach-and-foreachbatch > > Hope this helps. > > Thanks, > Jungtaek Lim (HeartSaVioR) > > > On Mon, Jan 20, 2020 at 8:43 PM Nick Dawes wrote: &

Re: Extract value from streaming Dataframe to a variable

2020-01-20 Thread Nick Dawes
Streaming experts, any clues how to achieve this? After extracting few variables, I need to run them through a REST API for verification and decision making. Thanks for your help. Nick On Fri, Jan 17, 2020, 6:27 PM Nick Dawes wrote: > I need to extract a value from a PySpark structu

Extract value from streaming Dataframe to a variable

2020-01-17 Thread Nick Dawes
streaming Dataframe, collect is not supported. Any workaround for this? Nick

Re: Structured Streaming Dataframe Size

2019-08-28 Thread Nick Dawes
t incrementally to update the result, and then discards the >> source data. It only keeps around the minimal intermediate *state* data >> as required to update the result (e.g. intermediate counts in the earlier >> example). >> > > > On Tue, Aug 27, 2019 at 1:21 PM Nick

Structured Streaming Dataframe Size

2019-08-27 Thread Nick Dawes
a on S3 or on local file system of the cluster? Nick

Spark Structured Streaming XML content

2019-08-14 Thread Nick Dawes
any xml functions to convert text data using schema? I saw an example for json data using from_json. Is it possible to use spark.read on a dataframe column? I need to find aggregated "Amount1" for every 5 minutes window. Thanks for your help. Nick

Re: Spark Image resizing

2019-07-31 Thread Nick Dawes
Any other way of resizing the image before creating the DataFrame in Spark? I know opencv does it. But I don't have opencv on my cluster. I have Anaconda python packages installed on my cluster. Any ideas will be appreciated. Thank you! On Tue, Jul 30, 2019, 4:17 PM Nick Dawes wrote:

Spark Image resizing

2019-07-30 Thread Nick Dawes
Hi I'm new to spark image data source. After creating a dataframe using Spark's image data source, I would like to resize the images in PySpark. df = spark.read.format("image").load(imageDir) Can you please help me with this? Nick

Re: [EXTERNAL] - Re: Problem with the ML ALS algorithm

2019-06-26 Thread Nick Pentreath
t I create using some > likelihood distributions of the rating values. I am only experimenting / > learning. In practice though, the list of items is likely to be at least > in the 10’s if not 100’s. Are even this item numbers to low? > > > > Thanks. > > > > -S >

Re: [EXTERNAL] - Re: Problem with the ML ALS algorithm

2019-06-26 Thread Nick Pentreath
; Number of items is 4 > > Ratings values are either 120, 20, 0 > > > > > > *From:* Nick Pentreath > *Sent:* Wednesday, June 26, 2019 6:03 AM > *To:* user@spark.apache.org > *Subject:* [EXTERNAL] - Re: Problem with the ML ALS algorithm > > > > This means that

Re: Problem with the ML ALS algorithm

2019-06-26 Thread Nick Pentreath
This means that the matrix that ALS is trying to factor is not positive definite. Try increasing regParam (try 0.1, 1.0 for example). What does the data look like? e.g. number of users, number of items, number of ratings, etc? On Wed, Jun 26, 2019 at 12:06 AM Steve Pruitt wrote: > I get an inex

Spark on Kubernetes Authentication error

2019-06-06 Thread Nick Dawes
be687016bc Tried creating different clusterrole with admin privilege. But it did not work. Any idea how to fix this one? Thanks. - Nick

Re: Getting List of Executor Id's

2019-05-13 Thread Afshartous, Nick
Answering my own question. Looks like this can be done by implementing SparkListener with method def onExecutorAdded(executorAdded: SparkListenerExecutorAdded): Unit as the SparkListenerExecutorAdded object has the info. -- Nick Am using

Getting List of Executor Id's

2019-05-13 Thread Afshartous, Nick
Hi, Am using Spark 2.3 and looking for an API in Java to fetch the list of executors. Need host and Id info for the executors. Thanks for any pointers, -- Nick - To unsubscribe e-mail: user-unsubscr

Re: How to use StringIndexer for multiple input /output columns in Spark Java

2018-05-14 Thread Nick Pentreath
Multi column support for StringIndexer didn’t make it into Spark 2.3.0 The PR is still in progress I think - should be available in 2.4.0 On Mon, 14 May 2018 at 22:32, Mina Aslani wrote: > Please take a look at the api doc: > https://spark.apache.org/docs/2.3.0/api/java/org/apache/spark/ml/feat

Re: A naive ML question

2018-04-29 Thread Nick Pentreath
One potential approach could be to construct a transition matrix showing the probability of moving from each state to another state. This can be visualized with a “heat map” encoding (I think matshow in numpy/matplotlib does this). On Sat, 28 Apr 2018 at 21:34, kant kodali wrote: > Hi, > > I mea

Re: StringIndexer with high cardinality huge data

2018-04-10 Thread Nick Pentreath
Also check out FeatureHasher in Spark 2.3.0 which is designed to handle this use case in a more natural way than HashingTF (and handles multiple columns at once). On Tue, 10 Apr 2018 at 16:00, Filipp Zhinkin wrote: > Hi Shahab, > > do you actually need to have a few columns with such a huge am

Re: Spark MLlib: Should I call .cache before fitting a model?

2018-02-27 Thread Nick Pentreath
Currently, fit for many (most I think) models will cache the input data. For LogisticRegression this is definitely the case, so you won't get any benefit from caching it yourself. On Tue, 27 Feb 2018 at 21:25 Gevorg Hari wrote: > Imagine that I am training a Spark MLlib model as follows: > > val

Re: Reverse MinMaxScaler in SparkML

2018-01-29 Thread Nick Pentreath
This would be interesting and a good addition I think. It bears some thought about the API though. One approach is to have an "inverseTransform" method similar to sklearn. The other approach is to "formalize" something like StringIndexerModel -> IndexToString. Here, the inverse transformer is a s

Re: Has there been any explanation on the performance degradation between spark.ml and Mllib?

2018-01-21 Thread Nick Pentreath
At least one of their comparisons is flawed. The Spark ML version of linear regression (*note* they use linear regression and not logistic regression, it is not clear why) uses L-BFGS as the solver, not SGD (as MLLIB uses). Hence it is typically going to be slower. However, it should in most cases

Re: [ML] Allow CrossValidation ParamGrid on SVMWithSGD

2018-01-19 Thread Nick Pentreath
SVMWithSGD sits in the older "mllib" package and is not compatible directly with the DataFrame API. I suppose one could write a ML-API wrapper around it. However, there is LinearSVC in Spark 2.2.x: http://spark.apache.org/docs/latest/ml-classification-regression.html#linear-support-vector-machine

Re: Access to Applications metrics

2017-12-04 Thread Nick Dimiduk
Bump. On Wed, Nov 15, 2017 at 2:28 PM, Nick Dimiduk wrote: > Hello, > > I'm wondering if it's possible to get access to the detailed > job/stage/task level metrics via the metrics system (JMX, Graphite, &c). > I've enabled the wildcard sink and I do not see

Re: does "Deep Learning Pipelines" scale out linearly?

2017-11-22 Thread Nick Pentreath
For that package specifically it’s best to see if they have a mailing list and if not perhaps ask on github issues. Having said that perhaps the folks involved in that package will reply here too. On Wed, 22 Nov 2017 at 20:03, Andy Davidson wrote: > I am starting a new deep learning project cur

Access to Applications metrics

2017-11-15 Thread Nick Dimiduk
tances, is this the case? Has anyone worked on a SparkListener that would bridge data from one to the other? Thanks, Nick

Re: StringIndexer on several columns in a DataFrame with Scala

2017-10-30 Thread Nick Pentreath
For now, you must follow this approach of constructing a pipeline consisting of a StringIndexer for each categorical column. See https://issues.apache.org/jira/browse/SPARK-11215 for the related JIRA to allow multiple columns for StringIndexer, which is being worked on currently. The reason you're

Re: How to run MLlib's word2vec in CBOW mode?

2017-09-28 Thread Nick Pentreath
MLlib currently doesn't support CBOW - there is an open PR for it (see https://issues.apache.org/jira/browse/SPARK-20372). On Thu, 28 Sep 2017 at 09:56 pun wrote: > Hello, > My understanding is that word2vec can be ran in two modes: > >- continuous bag-of-words (CBOW) (order of words does no

Re: isCached

2017-09-01 Thread Nick Pentreath
No unfortunately not - as i recall storageLevel accesses some private methods to get the result. On Fri, 1 Sep 2017 at 17:55, Nathan Kronenfeld wrote: > Ah, in 2.1.0. > > I'm in 2.0.1 at the moment... is there any way that works that far back? > > On Fri, Sep 1, 2017 at 11:4

Re: isCached

2017-09-01 Thread Nick Pentreath
Dataset does have storageLevel. So you can use isCached = (storageLevel != StorageLevel.NONE) as a test. Arguably isCached could be added to dataset too, shouldn't be a controversial change. On Fri, 1 Sep 2017 at 17:31, Nathan Kronenfeld wrote: > I'm currently porting some of our code from RDDs

Re: Updates on migration guides

2017-08-30 Thread Nick Pentreath
MLlib has tried quite hard to ensure the migration guide is up to date for each release. I think generally we catch all breaking and most major behavior changes On Wed, 30 Aug 2017 at 17:02, Dongjoon Hyun wrote: > +1 > > On Wed, Aug 30, 2017 at 7:54 AM, Xiao Li wrote: > >> Hi, Devs, >> >> Many

Re: Setting initial weights of ml.classification.LogisticRegression similar to mllib.classification.LogisticRegressionWithLBFGS

2017-07-20 Thread Nick Pentreath
u, Jul 20, 2017 at 12:50 PM, Nick Pentreath > wrote: > >> Currently it's not supported, but is on the roadmap: see >> https://issues.apache.org/jira/browse/SPARK-13025 >> >> The most recent attempt is to start with simple linear regression, as >> here: https:

Re: Setting initial weights of ml.classification.LogisticRegression similar to mllib.classification.LogisticRegressionWithLBFGS

2017-07-20 Thread Nick Pentreath
Currently it's not supported, but is on the roadmap: see https://issues.apache.org/jira/browse/SPARK-13025 The most recent attempt is to start with simple linear regression, as here: https://issues.apache.org/jira/browse/SPARK-21386 On Thu, 20 Jul 2017 at 08:36 Aseem Bansal wrote: > We were abl

Re: Regarding Logistic Regression changes in Spark 2.2.0

2017-07-19 Thread Nick Pentreath
L-BFGS is the default optimization method since the initial ML package implementation. The OWLQN variant is used only when L1 regularization is specified (via the elasticNetParam). 2.2 adds the box constraints (optimized using the LBFGS-B variant). So no, no upgrade is required to use L-BFGS - if

Re: Spark 2.1.1: A bug in org.apache.spark.ml.linalg.* when using VectorAssembler.scala

2017-07-13 Thread Nick Pentreath
There are Vector classes under ml.linalg package - And VectorAssembler and other feature transformers all work with ml.linalg vectors. If you try to use mllib.linalg vectors instead you will get an error as the user defined type for SQL is not correct On Thu, 13 Jul 2017 at 11:23, wrote: > Dea

Re: [PySpark]: How to store NumPy array into single DataFrame cell efficiently

2017-06-28 Thread Nick Pentreath
You will need to use PySpark vectors to store in a DataFrame. They can be created from Numpy arrays as follows: from pyspark.ml.linalg import Vectors df = spark.createDataFrame([("src1", "pkey1", 1, Vectors.dense(np.array([0, 1, 2])))]) On Wed, 28 Jun 2017 at 12:23 Judit Planas wrote: > Dear a

Trouble with PySpark UDFs and SPARK_HOME only on EMR

2017-06-22 Thread Nick Chammas
als, start_new_session) File "/usr/lib64/python3.5/subprocess.py", line 1544, in _execute_child raise child_exception_type(errno_num, err_msg) FileNotFoundError: [Errno 2] No such file or directory: './bin/spark-submit' Does anyone have clues about what might be going on? Nick ​ -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Trouble-with-PySpark-UDFs-and-SPARK-HOME-only-on-EMR-tp28778.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Question about mllib.recommendation.ALS

2017-06-08 Thread Nick Pentreath
Spark 2.2 will support the recommend-all methods in ML. Also, both ML and MLLIB performance has been greatly improved for the recommend-all methods. Perhaps you could check out the current RC of Spark 2.2 or master branch to try it out? N On Thu, 8 Jun 2017 at 17:18, Sahib Aulakh [Search] ­ < s

Re: spark ML Recommender program

2017-05-17 Thread Nick Pentreath
It sounds like this may be the same as https://issues.apache.org/jira/browse/SPARK-20402 On Thu, 18 May 2017 at 08:16 Nick Pentreath wrote: > Could you try setting the checkpoint interval for ALS (try 3, 5 say) and > see what the effect is? > > > > On Thu, 18 May 2017 at 0

Re: spark ML Recommender program

2017-05-17 Thread Nick Pentreath
Could you try setting the checkpoint interval for ALS (try 3, 5 say) and see what the effect is? On Thu, 18 May 2017 at 07:32 Mark Vervuurt wrote: > If you are running locally try increasing driver memory to for example 4G > en executor memory to 3G. > Regards, Mark > > On 18 May 2017, at 05:1

Re: ElasticSearch Spark error

2017-05-15 Thread Nick Pentreath
It may be best to ask on the elasticsearch-Hadoop github project On Mon, 15 May 2017 at 13:19, nayan sharma wrote: > Hi All, > > *ERROR:-* > > *Caused by: org.apache.spark.util.TaskCompletionListenerException: > Connection error (check network and/or proxy settings)- all nodes failed; > tried [[

Reading ORC file - fine on 1.6; GC timeout on 2+

2017-05-05 Thread Nick Chammas
strange and smells like buggy behavior. How can I debug this or workaround it in Spark 2+? Nick -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Reading-ORC-file-fine-on-1-6-GC-timeout-on-2-tp28654.html Sent from the Apache Spark User List mailing list archive

Re: pyspark vector

2017-04-25 Thread Nick Pentreath
Well the 3 in this case is the size of the sparse vector. This equates to the number of features, which for CountVectorizer (I assume that's what you're using) is also vocab size (number of unique terms). On Tue, 25 Apr 2017 at 04:06 Peyman Mohajerian wrote: > setVocabSize > > > On Mon, Apr 24,

Re: How to convert Spark MLlib vector to ML Vector?

2017-04-09 Thread Nick Pentreath
Why not use the RandomForest from Spark ML? On Sun, 9 Apr 2017 at 16:01, Md. Rezaul Karim < rezaul.ka...@insight-centre.org> wrote: > I have already posted this question to the StackOverflow > . > However,

Re: Spark 2.1 ml library scalability

2017-04-07 Thread Nick Pentreath
unds like something which > could be ran in parallel. > > > On Fri, Apr 7, 2017 at 5:20 PM, Nick Pentreath > wrote: > > What is the size of training data (number examples, number features)? > Dense or sparse features? How many classes? > > What commands are you using to sub

Re: Spark 2.1 ml library scalability

2017-04-07 Thread Nick Pentreath
What is the size of training data (number examples, number features)? Dense or sparse features? How many classes? What commands are you using to submit your job via spark-submit? On Fri, 7 Apr 2017 at 13:12 Aseem Bansal wrote: > When using spark ml's LogisticRegression, RandomForest, CrossValid

Spark fair scheduler pools vs. YARN queues

2017-04-05 Thread Nick Chammas
onflict? Does one override the other? I posted a more detailed question about an issue I'm having with this on Stack Overflow: http://stackoverflow.com/q/43239921/877069 Nick -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-fair-scheduler-pools-

Re: Collaborative filtering steps in spark

2017-03-29 Thread Nick Pentreath
No, it does a random initialization. It does use a slightly different approach from pure normal random - it chooses non-negative draws which results in very slightly better results empirically. In practice I'm not sure if the average rating approach will make a big difference (it's been a long whi

Re: Collaborative Filtering - scaling of the regularization parameter

2017-03-23 Thread Nick Pentreath
I usually advocate a JIRA even for small stuff but for doc only change like this it's ok to submit a PR directly with [MINOR] in title. On Thu, 23 Mar 2017 at 06:55, chris snow wrote: > Thanks Nick. If this will help other users, I'll create a JIRA and > send a patch. > &g

Re: Collaborative Filtering - scaling of the regularization parameter

2017-03-23 Thread Nick Pentreath
Yup, that is true and a reasonable clarification of the doc. On Thu, 23 Mar 2017 at 00:03 chris snow wrote: > The documentation for collaborative filtering is as follows: > > === > Scaling of the regularization parameter > > Since v1.1, we scale the regularization parameter lambda in solving > e

[ Spark Streaming & Kafka 0.10 ] Possible bug

2017-03-22 Thread Afshartous, Nick
Hi, I think I'm seeing a bug in the context of upgrading to using the Kafka 0.10 streaming API. Code fragments follow. -- Nick JavaInputDStream> rawStream = getDirectKafkaStream(); JavaDStream> messagesTuple = rawStream.map( new Funct

Re: Contributing to Spark

2017-03-19 Thread Nick Pentreath
If you have experience and interest in Python then PySpark is a good area to look into. Yes, adding things like tests & documentation is a good starting point. Start out relatively small and go from there. Adding new wrappers to python for ML is useful for slightly larger tasks. On Mon, 20 Mar

Re: Check if dataframe is empty

2017-03-07 Thread Nick Pentreath
I believe take on an empty dataset will return an empty Array rather than throw an exception. df.take(1).isEmpty should work On Tue, 7 Mar 2017 at 07:42, Deepak Sharma wrote: > If the df is empty , the .take would return > java.util.NoSuchElementException. > This can be done as below: > df.rdd.

[Spark Kafka] API Doc pages for Kafka 0.10 not current

2017-02-27 Thread Afshartous, Nick
, following the links API Docs --> (Scala | Java) leads to API pages that do not have class ConsumerStrategies) . The API doc package names also have streaming.kafka as opposed to streaming.kafka10. -- Nick

Re: Practical configuration to run LSH in Spark 2.1.0

2017-02-22 Thread Nick Pentreath
ty threshold, the number of hash tables, the > bucket width, etc... > > Thanks! > > On Mon, Feb 13, 2017 at 3:21 PM, Nick Pentreath > wrote: > > The original Uber authors provided this performance test result: > https://docs.google.com/document/d/19BXg-67U83NVB3M0I84HV

Re: Practical configuration to run LSH in Spark 2.1.0

2017-02-13 Thread Nick Pentreath
missing values? > No strings in unusual encoding? No additional or missing columns ? > 2) How long does your job run? What about garbage collector parameters? > Have you checked what happens with jconsole / jvisualvm ? > > Sincerely yours, Timur > > On Sat, Feb 11, 2017 at 12:52

Re: Practical configuration to run LSH in Spark 2.1.0

2017-02-10 Thread Nick Pentreath
What other params are you using for the lsh transformer? Are the issues occurring during transform or during the similarity join? On Fri, 10 Feb 2017 at 05:46, nguyen duc Tuan wrote: > hi Das, > In general, I will apply them to larger datasets, so I want to use LSH, > which is more scaleable t

Re: ML PIC

2017-01-16 Thread Nick Pentreath
The JIRA for this is here: https://issues.apache.org/jira/browse/SPARK-15784 There is a PR open already for it, which still needs to be reviewed. On Wed, 21 Dec 2016 at 18:01 Robert Hamilton wrote: > Thank you Nick that is good to know. > > Would this have some opportunity for newbs

Re: ML PIC

2016-12-21 Thread Nick Pentreath
It is part of the general feature parity roadmap. I can't recall offhand any blocker reasons it's just resources On Wed, 21 Dec 2016 at 17:05, Robert Hamilton wrote: > Hi all. Is it on the roadmap to have an > Spark.ml.clustering.PowerIterationClustering? Are there technical reasons > that there

Re: Issue in using DenseVector in RowMatrix, error could be due to ml and mllib package changes

2016-12-08 Thread Nick Pentreath
Yes most likely due to hashing tf returns ml vectors while you need mllib vectors for row matrix. I'd recommend using the vector conversion utils (I think in mllib.linalg.Vectors but I'm on mobile right now so can't recall exactly). There are until methods for converting single vectors as well as

Re: how to print auc & prc for GBTClassifier, which is okay for RandomForestClassifier

2016-11-28 Thread Nick Pentreath
This is because currently GBTClassifier doesn't extend the ClassificationModel abstract class, which in turn has the rawPredictionCol and related methods for generating that column. I'm actually not sure off hand whether this was because the GBT implementation could not produce the raw prediction

Re: scala.MatchError while doing BinaryClassificationMetrics

2016-11-14 Thread Nick Pentreath
sisErrorAt.failAnalysis(package.scala:42) On Mon, Nov 14, 2016 at 1:44 PM, Nick Pentreath wrote: DataFrame.rdd returns an RDD[Row]. You'll need to use map to extract the doubles from the test score and label DF. But you may prefer to just use spark.ml evaluators, which work with DataFrames

Re: scala.MatchError while doing BinaryClassificationMetrics

2016-11-14 Thread Nick Pentreath
DataFrame.rdd returns an RDD[Row]. You'll need to use map to extract the doubles from the test score and label DF. But you may prefer to just use spark.ml evaluators, which work with DataFrames. Try BinaryClassificationEvaluator. On Mon, 14 Nov 2016 at 19:30, Bhaarat Sharma wrote: > I am gettin

Re: Nearest neighbour search

2016-11-14 Thread Nick Pentreath
LSH-based NN search and similarity join should be out in Spark 2.1 - there's a little work being done still to clear up the APIs and some functionality. Check out https://issues.apache.org/jira/browse/SPARK-5992 On Mon, 14 Nov 2016 at 16:12, Kevin Mellott wrote: > You may be able to benefit fro

Re: Finding a Spark Equivalent for Pandas' get_dummies

2016-11-11 Thread Nick Pentreath
ata back to CSV. That is why I'm so interested in get_dummies > but it's not scalable enough for my data size (500-600GB per file). > > Thanks in advance. > > Nick > > -- > View this message in context: Finding a Spark Equivalent for Pandas&

Re: ALS.trainImplicit block sizes

2016-10-21 Thread Nick Pentreath
Oh also you mention 20 partitions. Is that how many you have? How many ratings? It may be worth trying to reparation to larger number of partitions. On Fri, 21 Oct 2016 at 17:04, Nick Pentreath wrote: > I wonder if you can try with setting different blocks for user and item? > Are you a

Re: ALS.trainImplicit block sizes

2016-10-21 Thread Nick Pentreath
default size too. > > On Fri, Oct 21, 2016 at 5:31 AM, Nick Pentreath > wrote: > > Did you try not setting the blocks parameter? It will then try to set it > automatically for your data size. > On Fri, 21 Oct 2016 at 09:16, Nikhil Mishra > wrote: > > I am using 105 no

Re: ALS.trainImplicit block sizes

2016-10-21 Thread Nick Pentreath
How many nodes are you using in the cluster? On Fri, 21 Oct 2016 at 08:58 Nikhil Mishra wrote: > Thanks Nick. > > So we do partition U x I matrix into BxB matrices, each of size around U/B > and I/B. Is that correct? Do you know whether a single block of the matrix > is repres

Re: [Spark ML] Using GBTClassifier in OneVsRest

2016-10-20 Thread Nick Pentreath
Currently no - GBT implements the predictors, not the classifier interface. It might be possible to wrap it in a wrapper that extends the Classifier trait. Hopefully GBT will support multi-class at some point. But you can use RandomForest which does support multi-class. On Fri, 21 Oct 2016 at 02:

Re: ALS.trainImplicit block sizes

2016-10-20 Thread Nick Pentreath
The blocks params will set both user and item blocks. Spark 2.0 supports user and item blocks for PySpark: http://spark.apache.org/docs/latest/api/python/pyspark.ml.html#module-pyspark.ml.recommendation On Fri, 21 Oct 2016 at 08:12 Nikhil Mishra wrote: > Hi, > > I have a question about the bloc

Re: Making more features in Logistic Regression

2016-10-18 Thread Nick Pentreath
You can use the PolynomialExpansion in Spark ML ( http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.ml.feature.PolynomialExpansion ) On Tue, 18 Oct 2016 at 21:47 miro wrote: > Yes, I was thinking going down this road: > > > http://scikit-learn.org/stable/modules/linear_mo

textFileStream dStream to DataFrame issues

2016-10-11 Thread Nick
I have question about how to use textFileStream, I have included a code snippet below. I am trying to read .gz files that are getting put into my bucket. I do not want to specify the schema, I have a similar feature that just does spark.read.json(inputBucket). This works great and if I can get t

Re: can mllib Logistic Regression package handle 10 million sparse features?

2016-10-11 Thread Nick Pentreath
on feature dimension (though the final shuffle size seems 50% of yours with 2x the features - this is with Spark 2.0.1, I haven't tested out master yet with this data). [image: Screen Shot 2016-10-11 at 12.03.55 PM.png] On Fri, 7 Oct 2016 at 08:11 DB Tsai wrote: > Hi Nick, > > >

Re: can mllib Logistic Regression package handle 10 million sparse features?

2016-10-06 Thread Nick Pentreath
I'm currently working on various performance tests for large, sparse feature spaces. For the Criteo DAC data - 45.8 million rows, 34.3 million features (categorical, extremely sparse), the time per iteration for ml.LogisticRegression is about 20-30s. This is with 4x worker nodes, 48 cores & 120GB

Re: why spark ml package doesn't contain svm algorithm

2016-09-27 Thread Nick Pentreath
There is a JIRA and PR for it - https://issues.apache.org/jira/browse/SPARK-14709 On Tue, 27 Sep 2016 at 09:10 hxw黄祥为 wrote: > I have found spark ml package have implement naivebayes algorithm and the > source code is simple,. > > I am confusing why spark ml package doesn’t contain svm algorithm

Re: Spark MLlib ALS algorithm

2016-09-23 Thread Nick Pentreath
The scale factor was only to scale up the number of ratings in the dataset for performance testing purposes, to illustrate the scalability of Spark ALS. It is not something you would normally do on your training dataset. On Fri, 23 Sep 2016 at 20:07, Roshani Nagmote wrote: > Hello, > > I was wor

Re: Similar Items

2016-09-21 Thread Nick Pentreath
Sorry, the original repo: https://github.com/karlhigley/spark-neighbors On Wed, 21 Sep 2016 at 13:09 Nick Pentreath wrote: > I should also point out another library I had not come across before : > https://github.com/sethah/spark-neighbors > > > On Tue, 20 Sep 2016 at 21:0

Re: Similar Items

2016-09-21 Thread Nick Pentreath
ch for the help! > > On Tue, Sep 20, 2016 at 1:15 PM, Kevin Mellott > wrote: > >> Thanks Nick - those examples will help a ton!! >> >> On Tue, Sep 20, 2016 at 12:20 PM, Nick Pentreath < >> nick.pentre...@gmail.com> wrote: >> >>> A few opt

Re: Similar Items

2016-09-20 Thread Nick Pentreath
/mrsqueeze/*spark*-hash <https://github.com/mrsqueeze/spark-hash> On Tue, 20 Sep 2016 at 18:06 Kevin Mellott wrote: > Thanks for the reply, Nick! I'm typically analyzing around 30-50K products > at a time (as an isolated set of products). Within this set of products > (which

Re: Is RankingMetrics' NDCG implementation correct?

2016-09-20 Thread Nick Pentreath
(cc'ing dev list also) I think a more general version of ranking metrics that allows arbitrary relevance scores could be useful. Ranking metrics are applicable to other settings like search or other learning-to-rank use cases, so it should be a little more generic than pure recommender settings.

Re: Similar Items

2016-09-19 Thread Nick Pentreath
How many products do you have? How large are your vectors? It could be that SVD / LSA could be helpful. But if you have many products then trying to compute all-pair similarity with brute force is not going to be scalable. In this case you may want to investigate hashing (LSH) techniques. On Mon

Re: Issues while running MLlib matrix factorization ALS algorithm

2016-09-19 Thread Nick Pentreath
Try als.setCheckpointInterval ( http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.recommendation.ALS@setCheckpointInterval(checkpointInterval:Int):ALS.this.type ) On Mon, 19 Sep 2016 at 20:01 Roshani Nagmote wrote: > Hello Sean, > > Can you please tell me how to set

Re: Is RankingMetrics' NDCG implementation correct?

2016-09-18 Thread Nick Pentreath
The PR already exists for adding RankingEvaluator to ML - https://github.com/apache/spark/pull/12461. I need to revive and review it. DB, your review would be welcome too (and also on https://github.com/apache/spark/issues/12574 which has implications for the semantics of ranking metrics in the Dat

Re: weightCol doesn't seem to be handled properly in PySpark

2016-09-12 Thread Nick Pentreath
Could you create a JIRA ticket for it? https://issues.apache.org/jira/browse/SPARK On Thu, 8 Sep 2016 at 07:50 evanzamir wrote: > When I am trying to use LinearRegression, it seems that unless there is a > column specified with weights, it will raise a py4j error. Seems odd > because > supposed

Re: How to convert an ArrayType to DenseVector within DataFrame?

2016-09-08 Thread Nick Pentreath
You can use a udf like this: Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 2.0.0 /_/ Using Python version 2.7.12 (default, Jul 2 2016 17:43:17) SparkSession available as 'spark'. In [1]: from pyspark.m

Re: I noticed LinearRegression sometimes produces negative R^2 values

2016-09-06 Thread Nick Pentreath
That does seem strange. Can you provide an example to reproduce? On Tue, 6 Sep 2016 at 21:49 evanzamir wrote: > Am I misinterpreting what r2() in the LinearRegression Model summary means? > By definition, R^2 should never be a negative number! > > > > -- > View this message in context: > http:

Re: Spark 2.0.0 - has anyone used spark ML to do predictions under 20ms?

2016-09-01 Thread Nick Pentreath
16 at 15:37 Nick Pentreath wrote: > Right now you are correct that Spark ML APIs do not support predicting on > a single instance (whether Vector for the models or a Row for a pipeline). > > See https://issues.apache.org/jira/browse/SPARK-10413 and > https://issues.apache.org/jir

Re: Spark 2.0.0 - has anyone used spark ML to do predictions under 20ms?

2016-09-01 Thread Nick Pentreath
Right now you are correct that Spark ML APIs do not support predicting on a single instance (whether Vector for the models or a Row for a pipeline). See https://issues.apache.org/jira/browse/SPARK-10413 and https://issues.apache.org/jira/browse/SPARK-16431 (duplicate) for some discussion. There m

Re: Equivalent of "predict" function from LogisticRegressionWithLBFGS in OneVsRest with LogisticRegression classifier (Spark 2.0)

2016-08-29 Thread Nick Pentreath
Try this: val df = spark.createDataFrame(Seq(Vectors.dense(Array(10, 590, 190, 700))).map(Tuple1.apply)).toDF("features") On Sun, 28 Aug 2016 at 11:06 yaroslav wrote: > Hi, > > We use such kind of logic for training our model > > val model = new LogisticRegressionWithLBFGS() > .setNum

Re: Breaking down text String into Array elements

2016-08-23 Thread Nick Pentreath
o Array and >> it will work. I was wondering if I could do this in Spark/Scala with my >> limited knowledge >> >> Cheers >> >> >> >> Dr Mich Talebzadeh >> >> >> >> LinkedIn * >> https://www.linkedin.com/profile/view?

Re: Breaking down text String into Array elements

2016-08-23 Thread Nick Pentreath
what is "text"? i.e. what is the "val text = ..." definition? If text is a String itself then indeed sc.parallelize(Array(text)) is doing the correct thing in this case. On Tue, 23 Aug 2016 at 19:42 Mich Talebzadeh wrote: > I am sure someone know this :) > > Created a dynamic text string which

Re: Vector size mismatch in logistic regression - Spark ML 2.0

2016-08-22 Thread Nick Pentreath
I believe it may be because of this issue ( https://issues.apache.org/jira/browse/SPARK-13030). OHE is not an estimator - hence in cases where the number of categories differ between train and test, it's not usable in the current form. It's tricky to work around, though one option is to use featur

Re: Model Persistence

2016-08-18 Thread Nick Pentreath
Model metadata (mostly parameter values) are usually tiny. The parquet data is most often for model coefficients. So this depends on the size of your model, i.e. Your feature dimension. A high-dimensional linear model can be quite large - but still typically easy to fit into main memory on a singl

Re: OOM with StringIndexer, 800m rows & 56m distinct value column

2016-08-11 Thread Nick Pentreath
/docs/latest/ml-features.html#tf-idf). On Thu, 11 Aug 2016 at 22:14 Ben Teeuwen wrote: > Thanks Nick, I played around with the hashing trick. When I set > numFeatures to the amount of distinct values for the largest sparse > feature, I ended up with half of them colliding. When ra

Re: Standardization with Sparse Vectors

2016-08-10 Thread Nick Pentreath
but not quite enabled an > optimization. > > > On Wed, Aug 10, 2016, 18:10 Nick Pentreath > wrote: > >> Sean by 'offset' do you mean basically subtracting the mean but only from >> the non-zero elements in each row? >> On Wed, 10 Aug 2016 at 19:02, S

Re: Spark2 SBT Assembly

2016-08-10 Thread Nick Pentreath
You're correct - Spark packaging has been shifted to not use the assembly jar. To build now use "build/sbt package" On Wed, 10 Aug 2016 at 19:40, Efe Selcuk wrote: > Hi Spark folks, > > With Spark 1.6 the 'assembly' target for sbt would build a fat jar with > all of the main Spark dependencies

  1   2   3   4   >