Re: [ANNOUNCE] Apache Spark 3.2.3 released

2022-11-30 Thread Maxim Gekk
Thank you, Chao!

On Wed, Nov 30, 2022 at 12:42 PM Jungtaek Lim 
wrote:

> Thanks Chao for driving the release!
>
> On Wed, Nov 30, 2022 at 6:03 PM Wenchen Fan  wrote:
>
>> Thanks, Chao!
>>
>> On Wed, Nov 30, 2022 at 1:33 AM Chao Sun  wrote:
>>
>>> We are happy to announce the availability of Apache Spark 3.2.3!
>>>
>>> Spark 3.2.3 is a maintenance release containing stability fixes. This
>>> release is based on the branch-3.2 maintenance branch of Spark. We
>>> strongly
>>> recommend all 3.2 users to upgrade to this stable release.
>>>
>>> To download Spark 3.2.3, head over to the download page:
>>> https://spark.apache.org/downloads.html
>>>
>>> To view the release notes:
>>> https://spark.apache.org/releases/spark-release-3-2-3.html
>>>
>>> We would like to acknowledge all community members for contributing to
>>> this
>>> release. This release would not have been possible without you.
>>>
>>> Chao
>>>
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>>


Re: [ANNOUNCE] Apache Spark 3.3.1 released

2022-10-26 Thread Maxim Gekk
Congratulations everyone with the new release, and thanks to Yuming for his
efforts.

Maxim Gekk

Software Engineer

Databricks, Inc.


On Wed, Oct 26, 2022 at 10:14 AM Hyukjin Kwon  wrote:

> Thanks, Yuming.
>
> On Wed, 26 Oct 2022 at 16:01, L. C. Hsieh  wrote:
>
>> Thank you for driving the release of Apache Spark 3.3.1, Yuming!
>>
>> On Tue, Oct 25, 2022 at 11:38 PM Dongjoon Hyun 
>> wrote:
>> >
>> > It's great. Thank you so much, Yuming!
>> >
>> > Dongjoon
>> >
>> > On Tue, Oct 25, 2022 at 11:23 PM Yuming Wang  wrote:
>> >>
>> >> We are happy to announce the availability of Apache Spark 3.3.1!
>> >>
>> >> Spark 3.3.1 is a maintenance release containing stability fixes. This
>> >> release is based on the branch-3.3 maintenance branch of Spark. We
>> strongly
>> >> recommend all 3.3 users to upgrade to this stable release.
>> >>
>> >> To download Spark 3.3.1, head over to the download page:
>> >> https://spark.apache.org/downloads.html
>> >>
>> >> To view the release notes:
>> >> https://spark.apache.org/releases/spark-release-3-3-1.html
>> >>
>> >> We would like to acknowledge all community members for contributing to
>> this
>> >> release. This release would not have been possible without you.
>> >>
>> >>
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>


Re: Apache Spark

2021-01-26 Thread Maxim Gekk
Hi Андрей,

You can write to https://databricks.com/company/contact . Probably, we can
offer something to you. For instance, Databricks has OEM program which
might be interesting to you:
https://partners.databricks.com/prm/English/c/Overview

Maxim Gekk

Software Engineer

Databricks, Inc.


On Tue, Jan 26, 2021 at 8:22 PM Lalwani, Jayesh 
wrote:

> All of the major cloud vendors have some sort of Spark offering. They
> provide support if you build in their cloud.
>
>
>
> *From: *Синий Андрей 
> *Date: *Tuesday, January 26, 2021 at 7:52 AM
> *To: *"user@spark.apache.org" 
> *Subject: *[EXTERNAL] Apache Spark
>
>
>
> *CAUTION*: This email originated from outside of the organization. Do not
> click links or open attachments unless you can confirm the sender and know
> the content is safe.
>
>
>
> Hello!
>
>
>
> We plan to use Apache Spark software in our organization, can I purchase
> paid technical support for this software?
>
>
>
>
>
>
>
> С уважением,
>
> Андрей Синий
>
> Руководитель направления
>
> Центр управления программным обеспечением
>
> Филиал ПАО «МТС» в Нижегородской области
>
> Публичное акционерное общество «Мобильные ТелеСистемы»
>
> __
>
> IP: 90096
>
> mob: +79103801534
>
> e-mail: avs...@mts.ru
>
> г. Нижний Новгород, пр. Гагарина, д. 168А, пом. П8, 3, 310
>
>
>
>
>


Re: Spark 3.0.1 new Proleptic Gregorian calendar

2020-11-19 Thread Maxim Gekk
Hello Saurabh,

>  What config options should we set,
> - if we are always going to read old data written from Spark2.4 using
Spark 3.0

You should set *spark.sql.legacy.parquet.datetimeRebaseModeInRead* to
*LEGACY *when you read old data*.*

You see this exception because Spark 3.0 cannot determine who wrote the
parquet files and which calendar was used while saving the files. Starting
from the version 2.4.6, Spark saves meta-data to parquet files, and Spark
3.0 can infer the mode automatically.

Maxim Gekk

Software Engineer

Databricks, Inc.


On Thu, Nov 19, 2020 at 8:10 PM Saurabh Gulati
 wrote:

> Hello,
> First of all, Thanks to you guys for maintaining and improving Spark.
>
> We just updated to Spark 3.0.1 and are facing some issues with the new
> Proleptic Gregorian calendar.
>
> We have data from different sources in our platform and we saw there were
> some * date/timestamp* columns that go back to years before 1500.
>
> According to this
> <https://www.waitingforcode.com/apache-spark-sql/whats-new-apache-spark-3-proleptic-calendar-date-time-management/read>
> post, data written with spark 2.4 and read with 3.0 should result in some
> difference in *dates/timestamps* but we are not able to replicate this
> issue. We only encounter an exception that suggests us to set
> *spark.sql.legacy.parquet.datetimeRebaseModeInRead/Write* config options
> to make it work.
>
> So, our main concern is:
>
>- How can we test/replicate this behavior? Since it's not very clear
>to us/nor we see any docs for this change, we can't decide with certainty
>which parameters to set and why.
>- What config options should we set,
>   -  if we are always going to read old data written from Spark2.4
>   using Spark 3.0
>   - will always be writing newer data with Spark3.0.
>
> We couldn't make a deterministic/informed choice so it's a better idea to
> ask the community what scenarios will be impacted and what will still work
> fine.
>
> Thanks
> Saurabh
>
>
>


Re: Spark 3.0 almost 1000 times slower to read json than Spark 2.4

2020-06-29 Thread Maxim Gekk
Hello Sanjeev,

It is hard to troubleshoot the issue without input files. Could you open an
JIRA ticket at https://issues.apache.org/jira/projects/SPARK and attach the
JSON files there (or samples or code which generates JSON files)?

Maxim Gekk

Software Engineer

Databricks, Inc.


On Mon, Jun 29, 2020 at 6:12 PM Sanjeev Mishra 
wrote:

> It has read everything. As you notice the timing of count is still smaller
> in Spark 2.4
>
> Spark 2.4
>
> scala> spark.time(spark.read.json("/data/20200528"))
> Time taken: 19691 ms
> res61: org.apache.spark.sql.DataFrame = [created: bigint, id: string ... 5
> more fields]
>
> scala> spark.time(res61.count())
> Time taken: 7113 ms
> res64: Long = 2605349
>
> Spark 3.0
> scala> spark.time(spark.read.json("/data/20200528"))
> 20/06/29 08:06:53 WARN package: Truncated the string representation of a
> plan since it was too large. This behavior can be adjusted by setting
> 'spark.sql.debug.maxToStringFields'.
> Time taken: 849652 ms
> res0: org.apache.spark.sql.DataFrame = [created: bigint, id: string ... 5
> more fields]
>
> scala> spark.time(res0.count())
> Time taken: 8201 ms
> res2: Long = 2605349
>
>
>
>
> On Mon, Jun 29, 2020 at 7:45 AM ArtemisDev  wrote:
>
>> Could you share your code?  Are you sure you Spark 2.4 cluster had indeed
>> read anything?  Looks like the Input size field is empty under 2.4.
>>
>> -- ND
>> On 6/27/20 7:58 PM, Sanjeev Mishra wrote:
>>
>>
>> I have large amount of json files that Spark can read in 36 seconds but
>> Spark 3.0 takes almost 33 minutes to read the same. On closer analysis,
>> looks like Spark 3.0 is choosing different DAG than Spark 2.0. Does anyone
>> have any idea what is going on? Is there any configuration problem with
>> Spark 3.0.
>>
>> Here are the details:
>>
>> *Spark 2.4*
>>
>> Summary Metrics for 2203 Completed Tasks
>> <http://10.0.0.8:4040/stages/stage/?id=0=0#tasksTitle>
>> Metric Min 25th percentile Median 75th percentile Max
>> Duration 0.0 ms 0.0 ms 0.0 ms 1.0 ms 62.0 ms
>> GC Time 0.0 ms 0.0 ms 0.0 ms 0.0 ms 11.0 ms
>> Showing 1 to 2 of 2 entries
>>   Aggregated Metrics by Executor
>> Show  20 40 60 100 All  entries
>> Search:
>> Executor ID Logs Address Task Time Total Tasks Failed Tasks Killed Tasks 
>> Succeeded
>> Tasks Blacklisted
>> driver
>> 10.0.0.8:49159 36 s 2203 0 0 2203 false
>>
>>
>> *Spark 3.0*
>>
>> Summary Metrics for 8 Completed Tasks
>> <http://10.0.0.8:4040/stages/stage/?id=1=0=1=47#tasksTitle>
>> Metric Min 25th percentile Median 75th percentile Max
>> Duration 3.8 min 4.0 min 4.1 min 4.4 min 5.0 min
>> GC Time 3 s 3 s 3 s 4 s 4 s
>> Input Size / Records 15.6 MiB / 51028 16.2 MiB / 53303 16.8 MiB / 55259 17.8
>> MiB / 58148 20.2 MiB / 71624
>> Showing 1 to 3 of 3 entries
>>   Aggregated Metrics by Executor
>> Show  20 40 60 100 All  entries
>> Search:
>> Executor ID Logs Address Task Time Total Tasks Failed Tasks Killed Tasks 
>> Succeeded
>> Tasks Blacklisted Input Size / Records
>> driver
>> 10.0.0.8:50224 33 min 8 0 0 8 false 136.1 MiB / 451999
>>
>>
>> The DAG is also different
>> Spark 2.0 DAG
>>
>> [image: Screenshot 2020-06-27 16.30.26.png]
>>
>> Spark 3.0 DAG
>>
>> [image: Screenshot 2020-06-27 16.32.32.png]
>>
>>
>>


Re: Better way to debug serializable issues

2020-02-18 Thread Maxim Gekk
Hi Ruijing,

Spark uses SerializationDebugger (
https://spark.apache.org/docs/latest/api/java/org/apache/spark/serializer/SerializationDebugger.html)
as default debugger to detect the serialization issues. You can take more
detailed serialization exception information by setting the following while
creating a cluster:
spark.driver.extraJavaOptions -Dsun.io.serialization.extendedDebugInfo=true
spark.executor.extraJavaOptions
-Dsun.io.serialization.extendedDebugInfo=true

Maxim Gekk

Software Engineer

Databricks, Inc.


On Tue, Feb 18, 2020 at 1:02 PM Ruijing Li  wrote:

> Hi all,
>
> When working with spark jobs, I sometimes have to tackle with
> serialization issues, and I have a difficult time trying to fix those. A
> lot of times, the serialization issues happen only in cluster mode across
> the network in a mesos container, so I can’t debug locally. And the
> exception thrown by spark is not very helpful to find the cause.
>
> I’d love to hear some tips on how to debug in the right places. Also, I’d
> be interested to know if in future releases it would be possible to point
> out which class or function is causing the serialization issue (right now I
> find its either Java generic classes or the class Spark is running itself).
> Thanks!
> --
> Cheers,
> Ruijing Li
>


Re: How to access line fileName in loading file using the textFile method

2018-09-24 Thread Maxim Gekk
> So my question is supposing all files are in a directory and I read then
using sc.textFile("path/*"), how can I understand each data is for which
file?

Maybe the input_file_name() function help you:
https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.functions$@input_file_name():org.apache.spark.sql.Column

On Mon, Sep 24, 2018 at 2:54 PM Soheil Pourbafrani 
wrote:

> Hi, My text data are in the form of text file. In the processing logic, I
> need to know each word is from which file. Actually, I need to tokenize the
> words and create the pair of . The naive solution is to
> call sc.textFile for each file and having the fileName in a variable,
> create the pairs, but it's not efficient and I got the StackOverflow error
> as dataset grew.
>
> So my question is supposing all files are in a directory and I read then
> using sc.textFile("path/*"), how can I understand each data is for which
> file?
>
> Is it possible (and needed) to customize the textFile method?
>


-- 

Maxim Gekk

Technical Solutions Lead

Databricks Inc.

maxim.g...@databricks.com

databricks.com

  <http://databricks.com/>


Re: How to read multiple libsvm files in Spark?

2018-09-24 Thread Maxim Gekk
Hi,

> Any other alternatives?

Manually form the input path by combining multiple paths via dots. See
https://issues.apache.org/jira/browse/SPARK-12086

On Thu, Sep 20, 2018 at 12:47 PM Md. Rezaul Karim <
rezaul.ka...@insight-centre.org> wrote:

> I'm experiencing "Exception in thread "main" java.io.IOException: Multiple
> input paths are not supported for libsvm data" exception while trying to
> read multiple libsvm files using Spark 2.3.0:
>
> val URLs =
> spark.read.format("libsvm").load("url_svmlight.tar/url_svmlight/*.svm")
>
> Any other alternatives?
>

  


Re: from_json function

2018-08-15 Thread Maxim Gekk
Hello Denis,

The from_json function supports only the fail fast mode, see:
https://github.com/apache/spark/blob/e2ab7deae76d3b6f41b9ad4d0ece14ea28db40ce/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala#L568

Your settings "mode" -> "PERMISSIVE" will be overwritten

On Wed, Aug 15, 2018 at 4:52 PM dbolshak  wrote:

> Hello community,
>
> I can not manage to run from_json method with "columnNameOfCorruptRecord"
> option.
> ```
> import org.apache.spark.sql.functions._
>
> val data = Seq(
>   "{'number': 1}",
>   "{'number': }"
> )
>
> val schema = new StructType()
>   .add($"number".int)
>   .add($"_corrupt_record".string)
>
> val sourceDf = data.toDF("column")
>
> val jsonedDf = sourceDf
>   .select(from_json(
> $"column",
> schema,
> Map("mode" -> "PERMISSIVE", "columnNameOfCorruptRecord" ->
> "_corrupt_record")
>   ) as "data").selectExpr("data.number", "data._corrupt_record")
>
>   jsonedDf.show()
> ```
> Does anybody can help me get `_corrupt_record` non empty?
>
> Thanks in advance.
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>

-- 

Maxim Gekk

Technical Solutions Lead

Databricks Inc.

maxim.g...@databricks.com

databricks.com

  <http://databricks.com/>