Re: Impact of .localCheckpoint() and executor dying

2021-01-06 Thread Jacek Laskowski
ble (as it's on a stable HDFS file system not on an ephemeral executor). In either case, the lineage should be the same = cut. Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski "The Internals Of" Online Books <https://books.japila.pl/> Follow me on http

Re: Impact of .localCheckpoint() and executor dying

2021-01-06 Thread Jacek Laskowski
ay back. I wish myself that someone with more skills in this area chimed in... Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski "The Internals Of" Online Books <https://books.japila.pl/> Follow me on https://twitter.com/jaceklaskowski <https://twitter.com/jacekl

Re: Converting spark batch to spark streaming

2021-01-08 Thread Jacek Laskowski
Hi, Start with DataStreamWriter.foreachBatch. Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski "The Internals Of" Online Books <https://books.japila.pl/> Follow me on https://twitter.com/jaceklaskowski <https://twitter.com/jaceklaskowski> On Thu, Jan 7, 2

Re: Data source v2 streaming sinks does not support Update mode

2021-01-12 Thread Jacek Laskowski
Hi, Can you post the whole message? I'm trying to find what might be causing it. A small reproducible example would be of help too. Thank you. Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski "The Internals Of" Online Books <https://books.japila.pl/&g

Re: Insertable records in Datasource v2.

2021-01-14 Thread Jacek Laskowski
a.schema.names: _*) .write .insertInto(sqlView) In summary, you should report this to JIRA, but don't expect this get fixed other than to catch this case just to throw this exception from ResolveRelations: Inserting into a view is not allowed" Unless I'm mistaken... Pozd

Re: Dynamic Spark metrics creation

2021-01-17 Thread Jacek Laskowski
Hey Yurii, > which is unavailable from executors. Register it on the driver and use accumulators on executors to update the values (on the driver)? Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski "The Internals Of" Online Books <https://books.japila.pl/>

Re: understanding spark shuffle file re-use better

2021-01-17 Thread Jacek Laskowski
n of a query is used to look up any cached queries. Again, I'm not really sure and if I'd have to answer it (e.g. as part of an interview) I'd say nothing would be shared / re-used. Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski "The Internals Of" Online Bo

Re: Spark Event Log Forwarding and Offset Tracking

2021-01-17 Thread Jacek Laskowski
ed and also forward that to ElasticSearch via log4j for monitoring Think SparkListener API would help here too. Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski "The Internals Of" Online Books <https://books.japila.pl/> Follow me on https://twitter.com/jaceklask

Re: Spark RDD + HBase: adoption trend

2021-01-20 Thread Jacek Laskowski
pment IMHO). Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski "The Internals Of" Online Books <https://books.japila.pl/> Follow me on https://twitter.com/jaceklaskowski <https://twitter.com/jaceklaskowski> On Wed, Jan 20, 2021 at 2:44 PM Marco Firrincieli wrote:

Re: RDD filter in for loop gave strange results

2021-01-20 Thread Jacek Laskowski
Hi Marco, A Scala dev here. In short: yet another reason against Python :) Honestly, I've got no idea why the code gives the output. Ran it with 3.1.1-rc1 and got the very same results. Hoping pyspark/python devs will chime in and shed more light on this. Pozdrawiam, Jacek Laskowski

Re: Query on entrypoint.sh Kubernetes spark

2021-01-21 Thread Jacek Laskowski
life as a container of a driver pod. There's no point using cluster deploy mode...ever. Makes sense? Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski "The Internals Of" Online Books <https://books.japila.pl/> Follow me on https://twitter.com/jaceklaskowski <https:

Re: Only one Active task in Spark Structured Streaming application

2021-01-21 Thread Jacek Laskowski
Hi, I'd look at stages and jobs as it's possible that the only task running is the missing one in a stage of a job. Just guessing... Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski "The Internals Of" Online Books <https://books.japila.pl/> Follow

Re: Application Timeout

2021-01-21 Thread Jacek Laskowski
Hi Brett, No idea why it happens, but got curious about this "Cores" column being 0. Is this always the case? Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski "The Internals Of" Online Books <https://books.japila.pl/> Follow me on https://twitter.

Re: Process each kafka record for structured streaming

2021-01-21 Thread Jacek Laskowski
Hi, Can you use console sink and make sure that the pipeline shows some progress? Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski "The Internals Of" Online Books <https://books.japila.pl/> Follow me on https://twitter.com/jaceklaskowski <https://twitter.com/j

Re: Column-level encryption in Spark SQL

2021-01-21 Thread Jacek Laskowski
Hi, Never heard of it (and have once been tasked to explore a similar use case). I'm curious how you'd like it to work? (no idea how Hive does this either) Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski "The Internals Of" Online Books <https://books.j

Re: Spark structured streaming - efficient way to do lots of aggregations on the same input files

2021-01-22 Thread Jacek Laskowski
Hi Filip, Care to share the code behind "The only thing I found so far involves using forEachBatch and manually updating my aggregates. "? I'm not completely sure I understand your use case and hope the code could shed more light on it. Thank you. Pozdrawiam, Jacek Lasko

Re: Spark Kubernetes 3.0.1 | podcreationTimeout not working

2021-02-10 Thread Jacek Laskowski
case they're not deleted as they simply wait forever. I might be mistaken here though. What property is this for "this timeout of 60 sec."? Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski "The Internals Of" Online Books <https://books.japila.pl/&

Re: Spark 3.0.1 | Volume to use For Spark Kubernetes Executor Part Files Storage

2021-03-08 Thread Jacek Laskowski
Hi, > as Executors terminates after their work completes. --conf spark.kubernetes.executor.deleteOnTermination=false ? Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski "The Internals Of" Online Books <https://books.japila.pl/> Follow me on https://twitter

Re: Spark 3.0.1 | Volume to use For Spark Kubernetes Executor Part Files Storage

2021-03-08 Thread Jacek Laskowski
Hi, On GCP I'd go for buckets in Google Storage. Not sure how reliable it is in production deployments though. Only demo experience here. Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski "The Internals Of" Online Books <https://books.japila.pl/> Follow me o

[k8s] PersistentVolumeClaim support in 3.1.1 on minikube

2021-03-15 Thread Jacek Laskowski
wski.github.io/spark-kubernetes-book/demo/persistentvolumeclaims/ Please help. Thank you! Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski "The Internals Of" Online Books <https://books.japila.pl/> Follow me on https://twitter.com/jaceklaskowski <https://twitter.com/jaceklaskowski>

Re: [k8s] PersistentVolumeClaim support in 3.1.1 on minikube

2021-03-15 Thread Jacek Laskowski
Hi, I think I found it. I should be using OnDemand claim name so it gets replaced to be unique per executor (?) Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski "The Internals Of" Online Books <https://books.japila.pl/> Follow me on https://twitter.com/jacekl

Source.getBatch and schema vs qe.analyzed.schema?

2021-03-29 Thread Jacek Laskowski
a3a4954291b74f8c8/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/Source.scala#L61 [2] https://github.com/apache/spark/blob/053dd858d38e6107bc71e0aa3a4954291b74f8c8/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/Source.scala#L35 Pozdrawiam, Jacek Laskowski h

Writing to Google Cloud Storage with v2 algorithm safe?

2021-04-03 Thread Jacek Laskowski
many HTTP calls are there under the covers? How to know it for GCS? Thank you for any help you can provide. Merci beaucoup mes amis :) [1] https://stackoverflow.com/q/66933229/1305344 Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski "The Internals Of" Online Books <

Re: Writing to Google Cloud Storage with v2 algorithm safe?

2021-04-03 Thread Jacek Laskowski
uot;safe" and "safety" meanings. Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski "The Internals Of" Online Books <https://books.japila.pl/> Follow me on https://twitter.com/jaceklaskowski <https://twitter.com/jaceklaskowski> On Sat, Apr 3,

Re: Source.getBatch and schema vs qe.analyzed.schema?

2021-04-03 Thread Jacek Laskowski
Hi Bartosz, This is not a question about whether the data source supports fixed or user-defined schema but what schema to use when requested for a streaming batch in Source.getBatch. Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski "The Internals Of" Online Bo

Re: Writing to Google Cloud Storage with v2 algorithm safe?

2021-04-04 Thread Jacek Laskowski
Hi Vaquar, Thanks a lot! Accepted as the answer (yet there was the other answer that was very helpful too). Tons of reading ahead to understand it more. That once again makes me feel that Hadoop MapReduce experience would help a great deal (and I've got none). Pozdrawiam, Jacek Lask

Re: Spark structured streaming + offset management in kafka + kafka headers

2021-04-04 Thread Jacek Laskowski
hat's what happens in Kafka Streams too Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski "The Internals Of" Online Books <https://books.japila.pl/> Follow me on https://twitter.com/jaceklaskowski <https://twitter.com/jaceklaskowski> On Sun, Apr 4, 2021

Re: Updating spark-env.sh per application

2021-05-09 Thread Jacek Laskowski
Hi, The easiest (but perhaps not necessarily the most flexible) is simply to use two different versions of spark-submit script with the env var set to two different values. Have you tried it yet? Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski "The Internals Of" On

Re: [ANNOUNCE] Apache Spark 3.1.2 released

2021-06-02 Thread Jacek Laskowski
Big shout-out to you, Dongjoon! Thank you. Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski "The Internals Of" Online Books <https://books.japila.pl/> Follow me on https://twitter.com/jaceklaskowski <https://twitter.com/jaceklaskowski> On Wed, Jun 2, 20

Re: Java : Testing RDD aggregateByKey

2021-08-19 Thread Jacek Laskowski
Hi Pedro, No idea what might be causing it. Do you perhaps have some code to reproduce it locally? Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski "The Internals Of" Online Books <https://books.japila.pl/> Follow me on https://twitter.com/jaceklaskowski <

Is memory-only no-disk Spark possible?

2021-08-20 Thread Jacek Laskowski
k to avoid OOMEs). Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski "The Internals Of" Online Books <https://books.japila.pl/> Follow me on https://twitter.com/jaceklaskowski <https://twitter.com/jaceklaskowski>

Re: Is memory-only no-disk Spark possible?

2021-08-21 Thread Jacek Laskowski
Hi Bobby, What a great summary of what happens behind the scenes! Enjoyed every sentence! "The default shuffle implementation will always write out to disk." <-- that's what I wasn't sure about the most. Thanks again! /me On digging deeper... Pozdrawiam, Jacek Laskowsk

Re: Java : Testing RDD aggregateByKey

2021-08-21 Thread Jacek Laskowski
w are the above different from yours? Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski "The Internals Of" Online Books <https://books.japila.pl/> Follow me on https://twitter.com/jaceklaskowski <https://twitter.com/jaceklaskowski> On Thu, Aug 19, 2021 at 5:

Re: Connection reset by peer : failed to remove cache rdd

2021-08-30 Thread Jacek Laskowski
rrors coming from broadcast joins perhaps? Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski "The Internals Of" Online Books <https://books.japila.pl/> Follow me on https://twitter.com/jaceklaskowski <https://twitter.com/jaceklaskowski> On Mon, Aug 30, 2021 at

Re: Spark Stream on Kubernetes Cannot Set up JavaSparkContext

2021-08-31 Thread Jacek Laskowski
a thought but wanted to share as I think it's worth investigating. Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski "The Internals Of" Online Books <https://books.japila.pl/> Follow me on https://twitter.com/jaceklaskowski <https://twitter.com/jaceklasko

Re: Spark Stream on Kubernetes Cannot Set up JavaSparkContext

2021-09-05 Thread Jacek Laskowski
part of Spark? You should not really be doing such risky config changes (unless you've got no other choice and you know what you're doing). Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski "The Internals Of" Online Books <https://books.japila.pl/>

Re: query time comparison to several SQL engines

2022-04-07 Thread Jacek Laskowski
g and am really curious (not implying that one is better or worse than the other(s)). Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski "The Internals Of" Online Books <https://books.japila.pl/> Follow me on https://twitter.com/jaceklaskowski <https://twitter.com/jacekl

Re: Prometheus with spark

2022-10-25 Thread Jacek Laskowski
Hi Raj, Do you want to do the following? spark.read.format("prometheus").load... I haven't heard of such a data source / format before. What would you like it for? Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski "The Internals Of" Online Books

Re: [ANNOUNCE] Apache Spark 3.3.1 released

2022-10-26 Thread Jacek Laskowski
Yoohoo! Thanks Yuming for driving this release. A tiny step for Spark a huge one for my clients (who still are on 3.2.1 or even older :)) Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski "The Internals Of" Online Books <https://books.japila.pl/> Follow me on htt

Re: spark.catalog.listFunctions type signatures

2023-03-28 Thread Jacek Laskowski
/github.com/apache/spark/blob/e60ce3e85081ca8bb247aeceb2681faf6a59a056/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala#L91 Pozdrawiam, Jacek Laskowski "The Internals Of" Online Books <https://books.japila.pl/> Follow me on https://

Re: [SparkSQL, SparkUI, RESTAPI] How to extract the WholeStageCodeGen ids from SparkUI

2023-04-12 Thread Jacek Laskowski
Hi, You could use QueryExecutionListener or Spark listeners to intercept query execution events and extract whatever is required. That's what web UI does (as it's simply a bunch of SparkListeners --> https://youtu.be/mVP9sZ6K__Y ;-)). Pozdrawiam, Jacek Laskowski "The In

Re: How to determine the function of tasks on each stage in an Apache Spark application?

2023-04-12 Thread Jacek Laskowski
reenshots won't give you that level of detail. You'd have to intercept execution events and correlate them. Not an easy task yet doable. HTH. Pozdrawiam, Jacek Laskowski "The Internals Of" Online Books <https://books.japila.pl/> Follow me on https://twitter.com/jaceklas

Re: How to create spark udf use functioncatalog?

2023-04-14 Thread Jacek Laskowski
d and used properly using the custom catalog impl. HTH Pozdrawiam, Jacek Laskowski "The Internals Of" Online Books <https://books.japila.pl/> Follow me on https://twitter.com/jaceklaskowski <https://twitter.com/jaceklaskowski> On Fri, Apr 14, 2023 at 2:10 PM 许新浩 <948

Re: How to determine the function of tasks on each stage in an Apache Spark application?

2023-04-14 Thread Jacek Laskowski
l/core/src/main/scala/org/apache/spark/sql/execution/ui/SQLListener.scala#L60 [4] https://github.com/apache/spark/blob/c124037b97538b2656d29ce547b2a42209a41703/sql/core/src/main/scala/org/apache/spark/sql/execution/ui/SQLTab.scala#L24 Pozdrawiam, Jacek Laskowski "The Internals Of&

Re: Error while merge in delta table

2023-05-11 Thread Jacek Laskowski
Hi Karthick, Sorry to say it but there's not enough "data" to help you. There should be something more above or below this exception snippet you posted that could pinpoint the root cause. Pozdrawiam, Jacek Laskowski "The Internals Of" Online Books <https://bo

Re: MLLib + Streaming

2016-03-06 Thread Jacek Laskowski
Hi, Thanks a lot, Guru Medasani, for such an excellent theory rich intro to MLlib! I wish I found only such emails in my mailbox. Sorry. Couldn't resist since I've just started with MLlib and your response has resonated so well with my initial experience. Thanks! Jacek 06.03.2016 6:55 AM "Guru

Re: Add the sql record having same field.

2016-03-06 Thread Jacek Laskowski
What about sum? Jacek 06.03.2016 7:28 AM "Angel Angel" napisał(a): > Hello, > I have one table and 2 fields in it > 1) item_id and > 2) count > > > > i want to add the count field as per item (means group the item_ids) > > example > Input > itea_ID Count > 500 2 > 200 6 > 500 4 > 100 3 > 200 6 >

Re: Spark structured streaming

2016-03-08 Thread Jacek Laskowski
Hi Praveen, I've spent few hours on the changes related to streaming dataframes (included in the SPARK-8360) and concluded that it's currently only possible to read.stream(), but not write.stream() since there are no streaming Sinks yet. Pozdrawiam, Jacek Laskowski https://

Re: Spark structured streaming

2016-03-08 Thread Jacek Laskowski
Hi Praveen, I don't really know. I think TD or Michael should know as they personally involved in the task (as far as I could figure it out from the JIRA and the changes). Ping people on the JIRA so they notice your question(s). Pozdrawiam, Jacek Laskowski https://medium.com/@jaceklask

Re: Get output of the ALS algorithm.

2016-03-11 Thread Jacek Laskowski
What about write.save(file)? P.s. I'm new to Spark MLlib. 11.03.2016 4:57 AM "Shishir Anshuman" napisał(a): > hello, > > I am new to Apache Spark and would like to get the Recommendation output > of the ALS algorithm in a file. > Please suggest me the solution. > > Thank you > > >

Re: How can I join two DataSet of same case class?

2016-03-11 Thread Jacek Laskowski
Hi, Use the names of the datasets not $, i. e. a("edid"). Jacek 11.03.2016 6:09 AM "박주형" napisał(a): > Hi. I want to join two DataSet. but below stderr is shown > > 16/03/11 13:55:51 WARN ColumnName: Constructing trivially true equals > predicate, ''edid = 'edid'. Perhaps you need to use aliase

Re: spark 1.6 foreachPartition only appears to be running on one executor

2016-03-11 Thread Jacek Laskowski
Hi, How do you check which executor is used? Can you include a screenshot of the master's webUI with workers? Jacek 11.03.2016 6:57 PM "Darin McBeath" napisał(a): > I've run into a situation where it would appear that foreachPartition is > only running on one of my executors. > > I have a small

Re: spark 1.6 foreachPartition only appears to be running on one executor

2016-03-11 Thread Jacek Laskowski
e other > executor). The executor that was used in the foreachPartition call works > fine and doesn't experience issue. But, because the other executor is > failing on every request the job dies. > > Darin. > > > > From: Jacek Laskowski > T

Re: Newbie question - Help with runtime error on augmentString

2016-03-11 Thread Jacek Laskowski
Hi, Why do you use maven not sbt for Scala? Can you show the entire pom.xml and the command to execute the app? Jacek 11.03.2016 7:33 PM "vasu20" napisał(a): > Hi > > Any help appreciated on this. I am trying to write a Spark program using > IntelliJ. I get a run time error as soon as new Sp

Re: udf StructField to JSON String

2016-03-11 Thread Jacek Laskowski
Hi Tristan, Mind sharing the relevant code? I'd like to learn the way you use Transformer to do so. Thanks! Jacek 11.03.2016 7:07 PM "Tristan Nixon" napisał(a): > I have a similar situation in an app of mine. I implemented a custom ML > Transformer that wraps the Jackson ObjectMapper - this giv

Re: adding rows to a DataFrame

2016-03-11 Thread Jacek Laskowski
Just a guess...flatMap? Jacek 11.03.2016 7:46 PM "Stefan Panayotov" napisał(a): > Hi, > > I have a problem that requires me to go through the rows in a DataFrame > (or possibly through rows in a JSON file) and conditionally add rows > depending on a value in one of the columns in each existing r

Re: Spark property parameters priority

2016-03-11 Thread Jacek Laskowski
Hi It could also be conf/spark-defaults.conf. Jacek 11.03.2016 8:07 PM "Cesar Flores" napisał(a): > > Right now I know of three different things to pass property parameters to > the Spark Context. They are: > >- A) Inside a SparkConf object just before creating the Spark Context >- B) D

Re: Newbie question - Help with runtime error on augmentString

2016-03-11 Thread Jacek Laskowski
> shade > > > > > > ${project.artifactId}-${project.version}-with-dependencies > > > > > > &

Re: How to distribute dependent files (.so , jar ) across spark worker nodes

2016-03-11 Thread Jacek Laskowski
Hi, For jars use spark-submit --jars. Dunno about so's. Could that work through jars? Jacek 11.03.2016 8:07 PM "prateek arora" napisał(a): > Hi > > I have multiple node cluster and my spark jobs depend on a native > library (.so files) and some jar files. > > Can some one please explain what ar

Re: help coercing types

2016-03-19 Thread Jacek Laskowski
Hi, Just a side question: why do you convert DataFrame to RDD? It's like driving backwards (possible but ineffective and dangerous at times) P. S. I'd even go for Dataset. Jacek 18.03.2016 5:20 PM "Bauer, Robert" napisał(a): > I have data that I pull in using a sql context and then I convert t

Re: Converting array of string type to datetime

2016-03-23 Thread Jacek Laskowski
Hi, Why don't you use Datasets? You'd cut the number of getStrings and it'd read nicer to your eyes. Also, doing such transformations would *likely* be easier. p.s. Please gist your example to fix it. Pozdrawiam, Jacek Laskowski https://medium.com/@jaceklaskowski/ Masteri

Re: Best way to determine # of workers

2016-03-25 Thread Jacek Laskowski
Hi, You may want to use SparkListener [1] (as webui) and listens to SparkListenerExecutorAdded and SparkListenerExecutorRemoved. [1] http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.scheduler.SparkListener Pozdrawiam, Jacek Laskowski https://medium.com

[SQL] Two columns in output vs one when joining DataFrames?

2016-03-25 Thread Jacek Laskowski
eft.join(right, Seq("_1")).show +---+---+---+ | _1| _2| _2| +---+---+---+ | 1| a| a| | 2| b| b| +---+---+---+ scala> left.join(right, left("_1") === right("_1")).show +---+---+---+---+ | _1| _2| _1| _2| +---+---+---+---+ | 1| a| 1| a| | 2| b| 2|

Re: Strange ML pipeline errors from HashingTF using v1.6.1

2016-03-28 Thread Jacek Laskowski
Hi, How do you run the pipeline? Do you assembly or package? Is this on local or spark or other cluster manager? What's the build configuration? Pozdrawiam, Jacek Laskowski https://medium.com/@jaceklaskowski/ Mastering Apache Spark http://bit.ly/mastering-apache-spark Follow me at

Select per Dataset attribute (Scala) not possible? Why no Seq().as[type] for Datasets?

2016-03-31 Thread Jacek Laskowski
error: missing parameter type for expanded function ((x$1) => x$1.id) ds.select(_.id).show ^ Is this supposed to work in Spark 2.0 (today's build)? BTW, Why is Seq(Text(0, "hello"), Text(1, "world")).as[Text] not possible? Pozdrawiam, Jacek Laskowsk

[SQL] A bug with withColumn?

2016-03-31 Thread Jacek Laskowski
5| | 1|swiecie| 1|three|5| +---+---+---+-+-+ scala> df.join(nums, df("id") === nums("id")).withColumn("TEXT", lit(5)).show +---++---++ | id|TEXT| id|TEXT| +---++---++ | 0| 5| 0| 5| | 1| 5| 1| 5| +---++---++ Pozdrawiam, Jacek L

Re: [SQL] A bug with withColumn?

2016-03-31 Thread Jacek Laskowski
Hi, Thanks Ted. It means that it's not only possible to rename a column using withColumnRenamed, but also replace the content of a column (in one shot) using withColumn with an existing column name. I can live with that :) Pozdrawiam, Jacek Laskowski https://medium.com/@jaceklask

Re: Select per Dataset attribute (Scala) not possible? Why no Seq().as[type] for Datasets?

2016-03-31 Thread Jacek Laskowski
Hi Ted, Sure! It works with map, but not with select. Wonder if it's by design or...will soon be fixed? Thanks again for your help. Pozdrawiam, Jacek Laskowski https://medium.com/@jaceklaskowski/ Mastering Apache Spark http://bit.ly/mastering-apache-spark Follow me at https://twitte

Re: [SQL] A bug with withColumn?

2016-04-01 Thread Jacek Laskowski
On Thu, Mar 31, 2016 at 5:47 PM, Jacek Laskowski wrote: > It means that it's not only possible to rename a column using > withColumnRenamed, but also replace the content of a column (in one > shot) using withColumn with an existing column name. I can live with > that :) Hi, No

Re: Graphframes pattern causing java heap space errors

2016-04-09 Thread Jacek Laskowski
de. Pozdrawiam, Jacek Laskowski https://medium.com/@jaceklaskowski/ Mastering Apache Spark http://bit.ly/mastering-apache-spark Follow me at https://twitter.com/jaceklaskowski On Sat, Apr 9, 2016 at 7:51 PM, Buntu Dev wrote: > I'm running this motif pattern against 1.5M vertice

Re: Submitting applications to Mesos with cluster deploy mode

2016-04-18 Thread Jacek Laskowski
Hi, Not that I might help much with deployment to Mesos, but can you describe your Mesos/Marathon setup? What's Mesos cluster dispatcher? Jacek 18.04.2016 12:54 PM "Joao Azevedo" napisał(a): > Hi! > > I'm trying to submit Spark applications to Mesos using the 'cluster' > deploy mode. I'm using

Re: Spark 2.0 Release Date

2016-04-28 Thread Jacek Laskowski
Hi Arun, My bet is...https://spark-summit.org/2016 :) Pozdrawiam, Jacek Laskowski https://medium.com/@jaceklaskowski/ Mastering Apache Spark http://bit.ly/mastering-apache-spark Follow me at https://twitter.com/jaceklaskowski On Thu, Apr 28, 2016 at 1:43 PM, Arun Patel wrote: > A sm

Re: ERROR SparkContext: Error initializing SparkContext.

2016-05-09 Thread Jacek Laskowski
the logs in YARN. Go to localhost:8088/cluster/apps and see the app's logs. Pozdrawiam, Jacek Laskowski https://medium.com/@jaceklaskowski/ Mastering Apache Spark http://bit.ly/mastering-apache-spark Follow me at https://twitter.com/jaceklaskowski On Mon, May 9, 2016 at 9:45 AM, A

Re: No of Spark context per jvm

2016-05-09 Thread Jacek Laskowski
Hi, I'd say "one per classloader". Pozdrawiam, Jacek Laskowski https://medium.com/@jaceklaskowski/ Mastering Apache Spark http://bit.ly/mastering-apache-spark Follow me at https://twitter.com/jaceklaskowski On Mon, May 9, 2016 at 10:16 AM, praveen S wrote: > Hi, > &

Re: Errors when running SparkPi on a clean Spark 1.6.1 on Mesos

2016-05-15 Thread Jacek Laskowski
ble deploy-mode). Also, deploy-mode client is the default deploy mode so you may safely remove it. Pozdrawiam, Jacek Laskowski https://medium.com/@jaceklaskowski/ Mastering Apache Spark http://bit.ly/mastering-apache-spark Follow me at https://twitter.com/jaceklaskowski On Sun, May 15,

Re: Errors when running SparkPi on a clean Spark 1.6.1 on Mesos

2016-05-15 Thread Jacek Laskowski
On Sun, May 15, 2016 at 5:50 PM, Richard Siebeling wrote: > I'm getting the following errors running SparkPi on a clean just compiled > and checked Mesos 0.29.0 installation with Spark 1.6.1 > > 16/05/15 23:05:52 ERROR TaskSchedulerImpl: Lost executor > e23f2d53-22c5-40f0-918d-0d73805fdfec-S0/0 o

Re: Executors and Cores

2016-05-15 Thread Jacek Laskowski
On Sun, May 15, 2016 at 8:19 AM, Mail.com wrote: > In all that I have seen, it seems each job has to be given the max resources > allowed in the cluster. Hi, I'm fairly sure it was because FIFO scheduling mode was used. You could change it to FAIR and make some adjustments. https://spark.apac

A bug with RDD Storage Info and links to sort rows?

2016-05-20 Thread Jacek Laskowski
(while Data Distribution table's columns not)? A bug? It at least looks a bit odd. Pozdrawiam, Jacek Laskowski https://medium.com/@jaceklaskowski/ Mastering Apache Spark http://bit.ly/mastering-apache-spark Follow me at https://twitter.com/jaceklask

Re: How spark depends on Guava

2016-05-23 Thread Jacek Laskowski
Hi Todd, It's used heavily for thread pool executors for one. Don't know about other uses. Jacek On 23 May 2016 5:49 a.m., "Todd" wrote: > Hi, > In the spark code, guava maven dependency scope is provided, my question > is, how spark depends on guava during runtime? I looked into the > spark-as

Re: Not able to write output to local filsystem from Standalone mode.

2016-05-24 Thread Jacek Laskowski
Hi, What happens when you create the parent directory /home/stuti? I think the failure is due to missing parent directories. What's the OS? Jacek On 24 May 2016 11:27 a.m., "Stuti Awasthi" wrote: Hi All, I have 3 nodes Spark 1.6 Standalone mode cluster with 1 Master and 2 Slaves. Also Im not h

Re: Not able to write output to local filsystem from Standalone mode.

2016-05-24 Thread Jacek Laskowski
Hi Mathieu, Thanks a lot for the answer! I did *not* know it's the driver to create the directory. You said "standalone mode", is this the case for the other modes - yarn and mesos? p.s. Did you find it in the code or...just experienced before? #curious Pozdrawiam, Jacek Lasko

Re: Accumulators displayed in SparkUI in 1.4.1?

2016-05-25 Thread Jacek Laskowski
On 25 May 2016 6:00 p.m., "Daniel Barclay" wrote: > > Was the feature of displaying accumulators in the Spark UI implemented in Spark 1.4.1, or was that added later? Dunno, but only *named* *accumulators* are displayed in Spark’s webUI (under Stages tab for a given stage). Jacek

Re: Not able to write output to local filsystem from Standalone mode.

2016-05-27 Thread Jacek Laskowski
sense to me". Pozdrawiam, Jacek Laskowski https://medium.com/@jaceklaskowski/ Mastering Apache Spark http://bit.ly/mastering-apache-spark Follow me at https://twitter.com/jaceklaskowski On Fri, May 27, 2016 at 3:42 AM, Yong Zhang wrote: > That just makes sense, doesn't it? >

Re: Spark Thrift Server run job as hive user

2016-05-31 Thread Jacek Laskowski
Hi, How do you start thrift server? What's your user name? I think it takes the user and always runs as it. Seen proxyUser today in spark-submit that may or may not be useful here. Jacek On 31 May 2016 10:01 a.m., "Radhika Kothari" wrote: Hi Anyone knows about spark thrift server always take h

Re: Spark Thrift Server run job as hive user

2016-05-31 Thread Jacek Laskowski
What's "With the help of UI"? Pozdrawiam, Jacek Laskowski https://medium.com/@jaceklaskowski/ Mastering Apache Spark http://bit.ly/mastering-apache-spark Follow me at https://twitter.com/jaceklaskowski On Tue, May 31, 2016 at 1:02 PM, Radhika Kothari wrote: > Hi, >

--driver-cores for Standalone and YARN only?! What about Mesos?

2016-06-01 Thread Jacek Laskowski
lease confirm (or fix) my understanding before I file a JIRA issue. Thanks! [1] https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L475-L476 Pozdrawiam, Jacek Laskowski https://medium.com/@jaceklaskowski/ Mastering Apache Spark http://bit.

Re: Container preempted by scheduler - Spark job error

2016-06-02 Thread Jacek Laskowski
Hi, Few things for closer examination: * is yarn master URL accepted in 1.3? I thought it was only in later releases. Since you're seeing the issue it seems it does work. * I've never seen specifying confs using a single string. Can you check in the Web ui they're applied? * what about this in

Re: how to increase threads per executor

2016-06-03 Thread Jacek Laskowski
--executor-cores 1 to be exact. Pozdrawiam, Jacek Laskowski https://medium.com/@jaceklaskowski/ Mastering Apache Spark http://bit.ly/mastering-apache-spark Follow me at https://twitter.com/jaceklaskowski On Fri, Jun 3, 2016 at 12:28 AM, Mich Talebzadeh wrote: > interesting. a vm with

Re: Akka with Hadoop/Spark

2016-06-05 Thread Jacek Laskowski
Hi, "I am supposed to work with akka and Hadoop in building apps on top of the data available in hadoop" <-- that's outside the topics covered in this mailing list (unless you're going to use Spark, too). Pozdrawiam, Jacek Laskowski https://medium.com/@jaceklaskowski/

Re: Basic question on using one's own classes in the Scala app

2016-06-05 Thread Jacek Laskowski
On Sun, Jun 5, 2016 at 9:01 PM, Ashok Kumar wrote: > Now I have added this > > libraryDependencies += "com.databricks" % "apps.twitter_classifier" > > However, I am getting an error > > > error: No implicit for Append.Value[Seq[sbt.ModuleID], > sbt.impl.GroupArtifactID] found, > so sbt.impl.Gr

Re: Apache Spark Kafka Integration - org.apache.spark.SparkException: Couldn't find leader offsets for Set()

2016-06-07 Thread Jacek Laskowski
Hi, What's the version of Spark? You're using Kafka 0.9.0.1, ain't you? What's the topic name? Jacek On 7 Jun 2016 11:06 a.m., "Dominik Safaric" wrote: > As I am trying to integrate Kafka into Spark, the following exception > occurs: > > org.apache.spark.SparkException: java.nio.channels.Closed

Re: Specify node where driver should run

2016-06-07 Thread Jacek Laskowski
Hi, It's not possible. YARN uses CPU and memory for resource constraints and places AM on any node available. Same about executors (unless data locality constraints the placement). Jacek On 6 Jun 2016 1:54 a.m., "Saiph Kappa" wrote: > Hi, > > In yarn-cluster mode, is there any way to specify on

Re: Spark 2.0 Release Date

2016-06-07 Thread Jacek Laskowski
On Tue, Jun 7, 2016 at 1:25 PM, Arun Patel wrote: > Do we have any further updates on release date? Nope :( And it's even more quiet than I could have thought. I was so certain that today's the date. Looks like Spark Summit has "consumed" all the people behind 2.0...Can't believe no one (from the

Re: Specify node where driver should run

2016-06-07 Thread Jacek Laskowski
Hi, --master yarn-client is deprecated and you should use --master yarn --deploy-mode client instead. There are two deploy-modes: client (default) and cluster. See http://spark.apache.org/docs/latest/cluster-overview.html. Pozdrawiam, Jacek Laskowski https://medium.com/@jaceklaskowski

Re: Spark 2.0 Release Date

2016-06-07 Thread Jacek Laskowski
Finally, the PMC voice on the subject. Thanks a lot, Sean! p.s. Given how much time it takes to ship 2.0 (with so many cool features already backed in!) I'd vote for releasing a few more RCs before 2.0 hits the shelves. I hope 2.0 is not Java 9 or Jigsaw ;-) Pozdrawiam, Jacek Lask

Re: Spark 2.0 Release Date

2016-06-07 Thread Jacek Laskowski
On Tue, Jun 7, 2016 at 3:25 PM, Sean Owen wrote: > That's not any kind of authoritative statement, just my opinion and guess. Oh, come on. You're not **a** Sean but **the** Sean (= a PMC member and the JIRA/PRs keeper) so what you say **is** kinda official. Sorry. But don't worry the PMC (the gro

Re: Environment tab meaning

2016-06-07 Thread Jacek Laskowski
for web UI that the console knows what happens under the covers (and can calculate the stats). BTW, spark.ui.port (default: 4040) controls the port Web UI binds to. Pozdrawiam, Jacek Laskowski https://medium.com/@jaceklaskowski/ Mastering Apache Spark http://bit.ly/mastering-apache-spark Follow m

Re: Environment tab meaning

2016-06-07 Thread Jacek Laskowski
onProtocol object. Pozdrawiam, Jacek Laskowski https://medium.com/@jaceklaskowski/ Mastering Apache Spark http://bit.ly/mastering-apache-spark Follow me at https://twitter.com/jaceklaskowski On Tue, Jun 7, 2016 at 8:18 PM, Jacek Laskowski wrote: > Hi, > > It is the driver - see the

Re: Environment tab meaning

2016-06-07 Thread Jacek Laskowski
Hi, I'm not surprised to see Hadoop jars on the driver (yet I couldn't explain exactly why they need to be there). I can't find a way now to display the classpath for executors. Pozdrawiam, Jacek Laskowski https://medium.com/@jaceklaskowski/ Mastering Apache Spark http://b

Re: setting column names on dataset

2016-06-07 Thread Jacek Laskowski
et[Person] = [name: string, age: int] scala> ds.as("a").joinWith(ds.as("b"), $"a.name" === $"b.name").show(false) +++ |_1 |_2 | +++ |[foo,42]|[foo,42]| |[bar,24]|[bar,24]| +++ Pozdrawiam, Jacek Lask

  1   2   3   4   5   >