Re: Ability to have CountVectorizerModel vocab as empty

2020-08-19 Thread Jatin Puri
Thanks Sean for the quick response. Logged a Jira: https://issues.apache.org/jira/browse/SPARK-32662 Will send a pull request shortly. Regards, Jatin On Wed, Aug 19, 2020 at 6:58 PM Sean Owen wrote: > I think that's true. You're welcome to open a pull request / JIRA to > remove that

Re: RDD which was checkpointed is not checkpointed

2020-08-19 Thread Ivan Petrov
Awesome, thanks for explaining it. ср, 19 авг. 2020 г. в 16:29, Russell Spitzer : > It determines whether it can use the checkpoint at runtime, so you'll be > able to see it in the UI but not in the plan since you are looking at the > plan > before the job is actually running when it checks to

Re: RDD which was checkpointed is not checkpointed

2020-08-19 Thread Russell Spitzer
It determines whether it can use the checkpoint at runtime, so you'll be able to see it in the UI but not in the plan since you are looking at the plan before the job is actually running when it checks to see if it can use the checkpoint in the lineage. Here is a two stage job for example:

Re: RDD which was checkpointed is not checkpointed

2020-08-19 Thread Ivan Petrov
i did it and see lineage change BEFORE calling action. No success. Job$ - isCheckpointed? false, getCheckpointFile: None Job$ - recordsRDD.toDebugString: (2) MapPartitionsRDD[7] at map at Job.scala:112 [] | MapPartitionsRDD[6] at map at Job.scala:111 [] | MapPartitionsRDD[5] at map at

Re: Spark3 on k8S reading encrypted data from HDFS with KMS in HA

2020-08-19 Thread Michel Sumbul
Hi Prashant, I have the problem only on K8S, it's working fine when spark is executed on top of yarn. I'm asking myself if the delegation gets saved, any idea how to check that? Could it be because kms is in HA and spark request 2 delegation token? For the testing, just running spark3 on top of

Re: Ability to have CountVectorizerModel vocab as empty

2020-08-19 Thread Sean Owen
I think that's true. You're welcome to open a pull request / JIRA to remove that requirement. On Wed, Aug 19, 2020 at 3:21 AM Jatin Puri wrote: > > Hello, > > This is wrt > https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/CountVectorizer.scala#L244 >

Re: RDD which was checkpointed is not checkpointed

2020-08-19 Thread Jacob Lynn
Hi Ivan, Unlike cache/persist, checkpoint does not operate in-place but requires the result to be assigned to a new variable. In your case: val recordsRDD = convertToRecords(anotherRDD).checkpoint() Best, Jacob Op wo 19 aug. 2020 om 14:39 schreef Ivan Petrov : > Hi! > Seems like I do smth

RDD which was checkpointed is not checkpointed

2020-08-19 Thread Ivan Petrov
Hi! Seems like I do smth wrong. I call .checkpoint() on RDD, but it's not checkpointed... What do I do wrong? val recordsRDD = convertToRecords(anotherRDD) recordsRDD.checkpoint() logger.info("checkpoint done") logger.info(s"isCheckpointed? ${recordsRDD.isCheckpointed}, getCheckpointFile:

Re: Spark3 on k8S reading encrypted data from HDFS with KMS in HA

2020-08-19 Thread Prashant Sharma
-dev Hi, I have used Spark with HDFS encrypted with Hadoop KMS, and it worked well. Somehow, I could not recall, if I had the kubernetes in the mix. Somehow, seeing the error, it is not clear what caused the failure. Can I reproduce this somehow? Thanks, On Sat, Aug 15, 2020 at 7:18 PM Michel

Ability to have CountVectorizerModel vocab as empty

2020-08-19 Thread Jatin Puri
Hello, This is wrt https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/CountVectorizer.scala#L244 require(vocab.length > 0, "The vocabulary size should be > 0. Lower minDF as necessary.") Currently, if `CountVectorizer` is trained on an empty dataset

Re: About how to read spark source code with a good way [Marketing Mail]

2020-08-19 Thread Jack Kolokasis
Hi Joyan, check this link: https://github.com/jackkolokasis/SparkInternals Thanks Iacovos On 19/8/20 9:09 π.μ., joyan sil wrote: Hi Jack and Spark experts, Further to the question asked in this thread, what are some recommended resources (blog/videos) that have helped you to deep dive into

Re: About how to read spark source code with a good way [Marketing Mail]

2020-08-19 Thread joyan sil
Hi Jack and Spark experts, Further to the question asked in this thread, what are some recommended resources (blog/videos) that have helped you to deep dive into the spark source code. Thanks Regards Joyan On Wed, Aug 19, 2020 at 11:06 AM Jack Kolokasis wrote: > Hi, > > From my experience, I