Re: Using existing distribution for join when subset of keys

2020-05-31 Thread Patrick Woody
You can use bucketBy to avoid shuffling in your scenario. This test suite > has some examples: > https://github.com/apache/spark/blob/45cf5e99503b00a6bd83ea94d6d92761db1a00ab/sql/core/src/test/scala/org/apache/spark/sql/sources/BucketedReadSuite.scala#L343 > > Thanks, > Terry > > On S

Using existing distribution for join when subset of keys

2020-05-31 Thread Patrick Woody
Hey all, I have one large table, A, and two medium sized tables, B & C, that I'm trying to complete a join on efficiently. The result is multiplicative on A join B, so I'd like to avoid shuffling that result. For this example, let's just assume each table has three columns, x, y, z. The below is

Re: [Spark SQL]: Can't write DataFrame after using explode function on multiple columns.

2020-08-03 Thread Patrick McCarthy
rk-user-list.1001560.n3.nabble.com/ > > - > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > > -- *Patrick McCarthy * Senior Data Scientist, Machine Learning Engineering Dstillery 470 Park Ave South, 17th Floor, NYC 10016

Re: [Spark SQL]: Can't write DataFrame after using explode function on multiple columns.

2020-08-03 Thread Patrick McCarthy
columns with list comprehensions forming a single select() statement makes for a smaller DAG. On Mon, Aug 3, 2020 at 10:06 AM Henrique Oliveira wrote: > Hi Patrick, thank you for your quick response. > That's exactly what I think. Actually, the result of this processing is an > int

Re: regexp_extract regex for extracting the columns from string

2020-08-10 Thread Patrick McCarthy
> apart from udf,is there any way to achieved it. > > > Thanks > > > > -- > Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ > > ----- > To unsubscribe e-mail: user-unsubscr...@spark.apache.org >

Building Spark 3.0.0 for Hive 1.2

2020-07-10 Thread Patrick McCarthy
ConfString(key, value) File "/home/pmccarthy/custom-spark-3/python/lib/py4j-src.zip/py4j/java_gateway.py", line 1305, in __call__ File "/home/pmccarthy/custom-spark-3/python/pyspark/sql/utils.py", line 137, in deco raise_from(converted) File "", line 3, in

Re: Issue in parallelization of CNN model using spark

2020-07-14 Thread Patrick McCarthy
>> > Mukhtaj >> > >> > >> > >> > >> >> - >> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >> >> -- *Patrick McCarthy * Senior Data Scientist, Machine Learning Engineering Dstillery 470 Park Ave South, 17th Floor, NYC 10016

Re: Reading TB of JSON file

2020-06-18 Thread Patrick McCarthy
fford having 50 GB on driver memory. In general, what > is the best practice to read large JSON file like 50 GB? > > Thanks > -- *Patrick McCarthy * Senior Data Scientist, Machine Learning Engineering Dstillery 470 Park Ave South, 17th Floor, NYC 10016

Re: Add python library

2020-06-08 Thread Patrick McCarthy
low is am >>> example: >>> >>> def do_something(p): >>> ... >>> >>> rdd = sc.parallelize([ >>> {"x": 1, "y": 2}, >>> {"x": 2, "y": 3}, >>> {"x": 3,

Re: Issue while installing dependencies Python Spark

2020-12-17 Thread Patrick McCarthy
path/to/venv/bin/python3 > > This did not help too.. > > Kind Regards, > Sachit Murarka > -- *Patrick McCarthy * Senior Data Scientist, Machine Learning Engineering Dstillery 470 Park Ave South, 17th Floor, NYC 10016

Re: Getting error message

2020-12-17 Thread Patrick McCarthy
ing code in a local machine that is single node machine. > > Getting into logs, it looked like the host is killed. This is happening > very frequently an I am unable to find the reason of this. > > Could low memory be the reason? > > On Fri, 18 Dec 2020, 00:11 Patrick McCar

Re: Getting error message

2020-12-17 Thread Patrick McCarthy
gram starts running fine. > This error goes away on > > On Thu, 17 Dec 2020, 23:50 Patrick McCarthy, > wrote: > >> my-domain.com/192.168.166.8:63534 probably isn't a valid address on your >> network, is it? >> >> On Thu, Dec 17, 2020 at 3:03 AM Vikas Garg wr

Re: Issue while installing dependencies Python Spark

2020-12-18 Thread Patrick McCarthy
that risk? In either case you move about the same number of bytes around. On Fri, Dec 18, 2020 at 3:04 PM Sachit Murarka wrote: > Hi Patrick/Users, > > I am exploring wheel file form packages for this , as this seems simple:- > > > https://bytes.grubhub.com/managing-dependen

Re: [Spark Core] Vectorizing very high-dimensional data sourced in long format

2020-10-30 Thread Patrick McCarthy
there other Spark patterns that I should attempt in order to achieve > my end goal of a vector of attributes for every entity? > > Thanks, Daniel > -- *Patrick McCarthy * Senior Data Scientist, Machine Learning Engineering Dstillery 470 Park Ave South, 17th Floor, NYC 10016

Profiling options for PandasUDF (2.4.7 on yarn)

2021-05-28 Thread Patrick McCarthy
of (count, row_id, column_id). It works at small scale but gets unstable as I scale up. Is there a way to profile this function in a spark session or am I limited to profiling on pandas data frames without spark? -- *Patrick McCarthy * Senior Data Scientist, Machine Learning Engineering

Re: Spark stand-alone mode

2023-09-15 Thread Patrick Tucci
I use Spark in standalone mode. It works well, and the instructions on the site are accurate for the most part. The only thing that didn't work for me was the start_all.sh script. Instead, I use a simple script that starts the master node, then uses SSH to connect to the worker machines and start

Re: Spark stand-alone mode

2023-09-19 Thread Patrick Tucci
Multiple applications can run at once, but you need to either configure Spark or your applications to allow that. In stand-alone mode, each application attempts to take all resources available by default. This section of the documentation has more details:

Re: Spark join produce duplicate rows in resultset

2023-10-22 Thread Patrick Tucci
.* to select I.*. This will show you the records from item that the join produces. If the first part of the code only returns one record, I expect you will see 4 distinct records returned here. Thanks, Patrick On Sun, Oct 22, 2023 at 1:29 AM Meena Rajani wrote: > Hello all: > > I am using

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-17 Thread Patrick Tucci
that the driver didn't have enough memory to broadcast objects. After increasing the driver memory, the query runs without issue. I hope this can be helpful to someone else in the future. Thanks again for the support, Patrick On Sun, Aug 13, 2023 at 7:52 AM Mich Talebzadeh wrote: > OK I use H

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-17 Thread Patrick Tucci
to this thread if the issue comes up again (hopefully it doesn't!). Thanks again, Patrick On Thu, Aug 17, 2023 at 1:54 PM Mich Talebzadeh wrote: > Hi Patrik, > > glad that you have managed to sort this problem out. Hopefully it will go > away for good. > > Still we are in the dark abou

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-17 Thread Patrick Tucci
acquires all available cluster resources when it starts. This is okay; as of right now, I am the only user of the cluster. If I add more users, they will also be SQL users, submitting queries through the Thrift server. Let me know if you have any other questions or thoughts. Thanks, Patrick On Thu

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-17 Thread Patrick Tucci
> such loss, damage or destruction. > > > > > On Thu, 17 Aug 2023 at 21:01, Patrick Tucci > wrote: > >> Hi Mich, >> >> Here are my config values from spark-defaults.conf: >> >> spark.eventLog.enabled true >> spark.eventLog.dir hdfs://10.0

Re: [PySpark] Getting the best row from each group

2022-12-19 Thread Patrick Tucci
Window functions don't work like traditional GROUP BYs. They allow you to partition data and pull any relevant column, whether it's used in the partition or not. I'm not sure what the syntax is for PySpark, but the standard SQL would be something like this: WITH InputData AS ( SELECT 'USA'

RE: Re: [Spark Sql] Global Setting for Case-Insensitive String Compare

2022-11-22 Thread Patrick Tucci
Thanks. How would I go about formally submitting a feature request for this? On 2022/11/21 23:47:16 Andrew Melo wrote: > I think this is the right place, just a hard question :) As far as I > know, there's no "case insensitive flag", so YMMV > > On Mon, Nov 21, 2022 at

RE: [Spark Sql] Global Setting for Case-Insensitive String Compare

2022-11-21 Thread Patrick Tucci
Is this the wrong list for this type of question? On 2022/11/12 16:34:48 Patrick Tucci wrote: > Hello, > > Is there a way to set string comparisons to be case-insensitive globally? I > understand LOWER() can be used, but my codebase contains 27k lines of SQL > and many string

[Spark Sql] Global Setting for Case-Insensitive String Compare

2022-11-12 Thread Patrick Tucci
row(s) Desired behavior would be true for all of the above with the proposed case-insensitive flag set. Thanks, Patrick

Re: Spark-SQL - Concurrent Inserts Into Same Table Throws Exception

2023-07-30 Thread Patrick Tucci
, Patrick On Sun, Jul 30, 2023 at 5:30 AM Pol Santamaria wrote: > Hi Patrick, > > You can have multiple writers simultaneously writing to the same table in > HDFS by utilizing an open table format with concurrency control. Several > formats, such as Apache Hudi, Apache Iceb

Spark-SQL - Concurrent Inserts Into Same Table Throws Exception

2023-07-29 Thread Patrick Tucci
/user/spark/warehouse/eventclaims. Is it possible to have multiple concurrent writers to the same table with Spark SQL? Is there any way to make this work? Thanks for the help. Patrick

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-10 Thread Patrick Tucci
of the reason why I chose it. Thanks again for the reply, I truly appreciate your help. Patrick On Thu, Aug 10, 2023 at 3:43 PM Mich Talebzadeh wrote: > sorry host is 10.0.50.1 > > Mich Talebzadeh, > Solutions Architect/Engineering Lead > London > United Kingdom > > >view m

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-10 Thread Patrick Tucci
hadoop -f command.sql Thanks again for your help. Patrick On Thu, Aug 10, 2023 at 2:24 PM Mich Talebzadeh wrote: > Can you run this sql query through hive itself? > > Are you using this command or similar for your thrift server? > > beeline -u jdbc:hive2:/

Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-10 Thread Patrick Tucci
, but no stages or tasks are executing or pending: [image: image.png] I've let the query run for as long as 30 minutes with no additional stages, progress, or errors. I'm not sure where to start troubleshooting. Thanks for your help, Patrick

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-11 Thread Patrick Tucci
-to-delta-using-jdbc Thanks again to everyone who replied for their help. Patrick On Fri, Aug 11, 2023 at 2:14 AM Mich Talebzadeh wrote: > Steve may have a valid point. You raised an issue with concurrent writes > before, if I recall correctly. Since this limitation may be due to Hive >

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-12 Thread Patrick Tucci
to Delta Lake and see if that solves the issue. Thanks again for your feedback. Patrick On Fri, Aug 11, 2023 at 10:09 AM Mich Talebzadeh wrote: > Hi Patrick, > > There is not anything wrong with Hive On-premise it is the best data > warehouse there is > > Hive handles both ORC and P

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-12 Thread Patrick Tucci
ll responsibility for any > loss, damage or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. &

Spark-Sql - Slow Performance With CTAS and Large Gzipped File

2023-06-26 Thread Patrick Tucci
d take more than 24x longer than a simple SELECT COUNT(*) statement. Thanks for any help. Please let me know if I can provide any additional information. Patrick Create Table.sql Description: Binary data - To unsubscribe e-mail

Re: Spark-Sql - Slow Performance With CTAS and Large Gzipped File

2023-06-26 Thread Patrick Tucci
. The same CTAS query only took about 45 minutes. This is still a bit slower than I had hoped, but the import from bzip fully utilized all available cores. So we can give the cluster more resources if we need the process to go faster. Patrick On Mon, Jun 26, 2023 at 12:52 PM Mich Talebzadeh wrote

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-13 Thread Patrick Tucci
loss, damage or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > > > On

unsubscribe

2023-11-09 Thread Duflot Patrick
unsubscribe

<    1   2   3   4