You can use bucketBy to avoid shuffling in your scenario. This test suite
> has some examples:
> https://github.com/apache/spark/blob/45cf5e99503b00a6bd83ea94d6d92761db1a00ab/sql/core/src/test/scala/org/apache/spark/sql/sources/BucketedReadSuite.scala#L343
>
> Thanks,
> Terry
>
> On S
Hey all,
I have one large table, A, and two medium sized tables, B & C, that I'm
trying to complete a join on efficiently. The result is multiplicative on A
join B, so I'd like to avoid shuffling that result. For this example, let's
just assume each table has three columns, x, y, z. The below is
rk-user-list.1001560.n3.nabble.com/
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>
--
*Patrick McCarthy *
Senior Data Scientist, Machine Learning Engineering
Dstillery
470 Park Ave South, 17th Floor, NYC 10016
columns with list comprehensions forming a single select() statement
makes for a smaller DAG.
On Mon, Aug 3, 2020 at 10:06 AM Henrique Oliveira wrote:
> Hi Patrick, thank you for your quick response.
> That's exactly what I think. Actually, the result of this processing is an
> int
> apart from udf,is there any way to achieved it.
>
>
> Thanks
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> -----
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
ConfString(key, value)
File
"/home/pmccarthy/custom-spark-3/python/lib/py4j-src.zip/py4j/java_gateway.py",
line 1305, in __call__
File "/home/pmccarthy/custom-spark-3/python/pyspark/sql/utils.py",
line 137, in deco
raise_from(converted)
File "", line 3, in
>> > Mukhtaj
>> >
>> >
>> >
>> >
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>
--
*Patrick McCarthy *
Senior Data Scientist, Machine Learning Engineering
Dstillery
470 Park Ave South, 17th Floor, NYC 10016
fford having 50 GB on driver memory. In general, what
> is the best practice to read large JSON file like 50 GB?
>
> Thanks
>
--
*Patrick McCarthy *
Senior Data Scientist, Machine Learning Engineering
Dstillery
470 Park Ave South, 17th Floor, NYC 10016
low is am
>>> example:
>>>
>>> def do_something(p):
>>> ...
>>>
>>> rdd = sc.parallelize([
>>> {"x": 1, "y": 2},
>>> {"x": 2, "y": 3},
>>> {"x": 3,
path/to/venv/bin/python3
>
> This did not help too..
>
> Kind Regards,
> Sachit Murarka
>
--
*Patrick McCarthy *
Senior Data Scientist, Machine Learning Engineering
Dstillery
470 Park Ave South, 17th Floor, NYC 10016
ing code in a local machine that is single node machine.
>
> Getting into logs, it looked like the host is killed. This is happening
> very frequently an I am unable to find the reason of this.
>
> Could low memory be the reason?
>
> On Fri, 18 Dec 2020, 00:11 Patrick McCar
gram starts running fine.
> This error goes away on
>
> On Thu, 17 Dec 2020, 23:50 Patrick McCarthy,
> wrote:
>
>> my-domain.com/192.168.166.8:63534 probably isn't a valid address on your
>> network, is it?
>>
>> On Thu, Dec 17, 2020 at 3:03 AM Vikas Garg wr
that risk? In either case you move about the same
number of bytes around.
On Fri, Dec 18, 2020 at 3:04 PM Sachit Murarka
wrote:
> Hi Patrick/Users,
>
> I am exploring wheel file form packages for this , as this seems simple:-
>
>
> https://bytes.grubhub.com/managing-dependen
there other Spark patterns that I should attempt in order to achieve
> my end goal of a vector of attributes for every entity?
>
> Thanks, Daniel
>
--
*Patrick McCarthy *
Senior Data Scientist, Machine Learning Engineering
Dstillery
470 Park Ave South, 17th Floor, NYC 10016
of (count,
row_id, column_id).
It works at small scale but gets unstable as I scale up. Is there a way to
profile this function in a spark session or am I limited to profiling on
pandas data frames without spark?
--
*Patrick McCarthy *
Senior Data Scientist, Machine Learning Engineering
I use Spark in standalone mode. It works well, and the instructions on the
site are accurate for the most part. The only thing that didn't work for me
was the start_all.sh script. Instead, I use a simple script that starts the
master node, then uses SSH to connect to the worker machines and start
Multiple applications can run at once, but you need to either configure
Spark or your applications to allow that. In stand-alone mode, each
application attempts to take all resources available by default. This
section of the documentation has more details:
.* to select I.*. This will
show you the records from item that the join produces. If the first part of
the code only returns one record, I expect you will see 4 distinct records
returned here.
Thanks,
Patrick
On Sun, Oct 22, 2023 at 1:29 AM Meena Rajani wrote:
> Hello all:
>
> I am using
that the
driver didn't have enough memory to broadcast objects. After increasing the
driver memory, the query runs without issue.
I hope this can be helpful to someone else in the future. Thanks again for
the support,
Patrick
On Sun, Aug 13, 2023 at 7:52 AM Mich Talebzadeh
wrote:
> OK I use H
to this thread if the
issue comes up again (hopefully it doesn't!).
Thanks again,
Patrick
On Thu, Aug 17, 2023 at 1:54 PM Mich Talebzadeh
wrote:
> Hi Patrik,
>
> glad that you have managed to sort this problem out. Hopefully it will go
> away for good.
>
> Still we are in the dark abou
acquires all available
cluster resources when it starts. This is okay; as of right now, I am the
only user of the cluster. If I add more users, they will also be SQL users,
submitting queries through the Thrift server.
Let me know if you have any other questions or thoughts.
Thanks,
Patrick
On Thu
> such loss, damage or destruction.
>
>
>
>
> On Thu, 17 Aug 2023 at 21:01, Patrick Tucci
> wrote:
>
>> Hi Mich,
>>
>> Here are my config values from spark-defaults.conf:
>>
>> spark.eventLog.enabled true
>> spark.eventLog.dir hdfs://10.0
Window functions don't work like traditional GROUP BYs. They allow you to
partition data and pull any relevant column, whether it's used in the
partition or not.
I'm not sure what the syntax is for PySpark, but the standard SQL would be
something like this:
WITH InputData AS
(
SELECT 'USA'
Thanks. How would I go about formally submitting a feature request for this?
On 2022/11/21 23:47:16 Andrew Melo wrote:
> I think this is the right place, just a hard question :) As far as I
> know, there's no "case insensitive flag", so YMMV
>
> On Mon, Nov 21, 2022 at
Is this the wrong list for this type of question?
On 2022/11/12 16:34:48 Patrick Tucci wrote:
> Hello,
>
> Is there a way to set string comparisons to be case-insensitive
globally? I
> understand LOWER() can be used, but my codebase contains 27k lines of SQL
> and many string
row(s)
Desired behavior would be true for all of the above with the proposed
case-insensitive flag set.
Thanks,
Patrick
,
Patrick
On Sun, Jul 30, 2023 at 5:30 AM Pol Santamaria wrote:
> Hi Patrick,
>
> You can have multiple writers simultaneously writing to the same table in
> HDFS by utilizing an open table format with concurrency control. Several
> formats, such as Apache Hudi, Apache Iceb
/user/spark/warehouse/eventclaims.
Is it possible to have multiple concurrent writers to the same table with
Spark SQL? Is there any way to make this work?
Thanks for the help.
Patrick
of the reason why I chose it.
Thanks again for the reply, I truly appreciate your help.
Patrick
On Thu, Aug 10, 2023 at 3:43 PM Mich Talebzadeh
wrote:
> sorry host is 10.0.50.1
>
> Mich Talebzadeh,
> Solutions Architect/Engineering Lead
> London
> United Kingdom
>
>
>view m
hadoop -f command.sql
Thanks again for your help.
Patrick
On Thu, Aug 10, 2023 at 2:24 PM Mich Talebzadeh
wrote:
> Can you run this sql query through hive itself?
>
> Are you using this command or similar for your thrift server?
>
> beeline -u jdbc:hive2:/
, but no stages or tasks are
executing or pending:
[image: image.png]
I've let the query run for as long as 30 minutes with no additional stages,
progress, or errors. I'm not sure where to start troubleshooting.
Thanks for your help,
Patrick
-to-delta-using-jdbc
Thanks again to everyone who replied for their help.
Patrick
On Fri, Aug 11, 2023 at 2:14 AM Mich Talebzadeh
wrote:
> Steve may have a valid point. You raised an issue with concurrent writes
> before, if I recall correctly. Since this limitation may be due to Hive
>
to Delta Lake and
see if that solves the issue.
Thanks again for your feedback.
Patrick
On Fri, Aug 11, 2023 at 10:09 AM Mich Talebzadeh
wrote:
> Hi Patrick,
>
> There is not anything wrong with Hive On-premise it is the best data
> warehouse there is
>
> Hive handles both ORC and P
ll responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
&
d take more than 24x longer than a simple
SELECT COUNT(*) statement.
Thanks for any help. Please let me know if I can provide any additional
information.
Patrick
Create Table.sql
Description: Binary data
-
To unsubscribe e-mail
.
The same CTAS query only took about 45 minutes. This is still a bit slower
than I had hoped, but the import from bzip fully utilized all available
cores. So we can give the cluster more resources if we need the process to
go faster.
Patrick
On Mon, Jun 26, 2023 at 12:52 PM Mich Talebzadeh
wrote
loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On
unsubscribe
301 - 338 of 338 matches
Mail list logo