Re: Honor ParseMode in AvroFileFormat

2019-03-07 Thread Gengliang
Hi Tim, I think you can try setting the option *spark.sql.files.ignoreCorruptFiles *as *true*. With the option enabled, the Spark jobs will continue to run when encountering corrupted files and the contents that have been read will still be returned. The CSV/JSON data source supports the

Re: [SQL] hash: 64-bits and seeding

2019-03-07 Thread Huon.Wilson
Thanks for the guidance. That was my initial inclination, but I decided that consistency with the existing ‘hash’ was better. However, like you, I also prefer the specific form. I’ve opened https://issues.apache.org/jira/browse/SPARK-27099 and submitted the patch (using ‘xxhash64’) at

[VOTE] Release Apache Spark 2.4.1 (RC6)

2019-03-07 Thread DB Tsai
Please vote on releasing the following candidate as Apache Spark version 2.4.1. The vote is open until March 11 PST and passes if a majority +1 PMC votes are cast, with a minimum of 3 +1 votes. [ ] +1 Release this package as Apache Spark 2.4.1 [ ] -1 Do not release this package because ... To

Re: Honor ParseMode in AvroFileFormat

2019-03-07 Thread tim
/facepalm Here we go: https://issues.apache.org/jira/browse/SPARK-27093 Tim -- Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/ - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: Honor ParseMode in AvroFileFormat

2019-03-07 Thread tim
Thanks Xiao, it's good to have that validated. I've created a ticket here: https://issues.apache.org/jira/browse/AVRO-2342 -- Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/ - To unsubscribe e-mail:

Re: [pyspark] dataframe map_partition

2019-03-07 Thread peng yu
right now, i'm using the colums-at-a-time mapping https://github.com/yupbank/tf-spark-serving/blob/master/tss/utils.py#L129 On Thu, Mar 7, 2019 at 4:00 PM Sean Owen wrote: > Maybe, it depends on what you're doing. It sounds like you are trying > to do row-at-a-time mapping, even on a pandas

Re: [pyspark] dataframe map_partition

2019-03-07 Thread Sean Owen
Maybe, it depends on what you're doing. It sounds like you are trying to do row-at-a-time mapping, even on a pandas DataFrame. Is what you're doing vectorized? may not help much. Just make the pandas Series into a DataFrame if you want? and a single col back to Series? On Thu, Mar 7, 2019 at 2:45

Honor ParseMode in AvroFileFormat

2019-03-07 Thread tim
Hi Spark Devs, We're processing a large number of Avro files with Spark and found that the Avro reader is missing the ability to handle malformed or truncated files like the JSON reader. Currently the Avro reader throws exceptions when it encounters any bad or truncated record in an Avro file,

Re: [pyspark] dataframe map_partition

2019-03-07 Thread peng yu
it is very similar to SCALAR, but for SCALAR the output can't be struct/row and the input has to be pd.Series, which doesn't support a row. I'm doing tensorflow batch inference in spark, https://github.com/yupbank/tf-spark-serving/blob/master/tss/serving.py#L108 Which i have to do the groupBy in

RE: Hive Hash in Spark

2019-03-07 Thread tcondie
Thanks Ryan and Reynold for the information! Cheers, Tyson From: Ryan Blue Sent: Wednesday, March 6, 2019 3:47 PM To: Reynold Xin Cc: tcon...@gmail.com; Spark Dev List Subject: Re: Hive Hash in Spark I think this was needed to add support for bucketed Hive tables. Like Tyson

Re: [pyspark] dataframe map_partition

2019-03-07 Thread peng yu
and in this case, i'm actually benefiting from the columns of arrow support, so that i can pass the whole data block to tensorflow to obtain the block of prediction all at once. On Thu, Mar 7, 2019 at 3:45 PM peng yu wrote: > pandas/arrow is for the memory efficiency, and mapPartitions is only

Re: [pyspark] dataframe map_partition

2019-03-07 Thread peng yu
pandas/arrow is for the memory efficiency, and mapPartitions is only available to rdds, for sure i can do everything in rdd. But i thought that's the whole point of having pandas_udf, so my program run faster and consumes less memory ? On Thu, Mar 7, 2019 at 3:40 PM Sean Owen wrote: > Are you

Re: [pyspark] dataframe map_partition

2019-03-07 Thread Sean Owen
Are you just applying a function to every row in the DataFrame? you don't need pandas at all. Just get the RDD of Row from it and map a UDF that makes another Row, and go back to DataFrame. Or make a UDF that operates on all columns and returns a new value. mapPartitions is also available if you

Re: [pyspark] dataframe map_partition

2019-03-07 Thread Sean Owen
Are you looking for SCALAR? that lets you map one row to one row, but do it more efficiently in batch. What are you trying to do? On Thu, Mar 7, 2019 at 2:03 PM peng yu wrote: > > I'm looking for a mapPartition(pandas_udf) for a pyspark.Dataframe. > > ``` > @pandas_udf(df.schema,

Re: [pyspark] dataframe map_partition

2019-03-07 Thread peng yu
I'm looking for a mapPartition(pandas_udf) for a pyspark.Dataframe. ``` @pandas_udf(df.schema, PandasUDFType.MAP) def do_nothing(pandas_df): return pandas_df new_df = df.mapPartition(do_nothing) ``` pandas_udf only support scala or GROUPED_MAP. Why not support just Map? On Thu, Mar 7,

Re: [pyspark] dataframe map_partition

2019-03-07 Thread Sean Owen
Are you looking for @pandas_udf in Python? Or just mapPartition? Those exist already On Thu, Mar 7, 2019, 1:43 PM peng yu wrote: > There is a nice map_partition function in R `dapply`. so that user can > pass a row to udf. > > I'm wondering why we don't have that in python? > > I'm trying to

[pyspark] dataframe map_partition

2019-03-07 Thread peng yu
There is a nice map_partition function in R `dapply`. so that user can pass a row to udf. I'm wondering why we don't have that in python? I'm trying to have a map_partition function with pandas_udf supported thanks!

Re: [VOTE] Release Apache Spark 2.4.1 (RC2)

2019-03-07 Thread shane knapp
https://issues.apache.org/jira/browse/SPARK-26742 On Thu, Mar 7, 2019 at 10:52 AM shane knapp wrote: > i'm ready to update the ubuntu workers/minikube/k8s to support the 4.1.2 > client: > https://issues.apache.org/jira/browse/SPARK-2674 > > i am more than comfortable with this build system

Re: [VOTE] Release Apache Spark 2.4.1 (RC2)

2019-03-07 Thread shane knapp
i'm ready to update the ubuntu workers/minikube/k8s to support the 4.1.2 client: https://issues.apache.org/jira/browse/SPARK-2674 i am more than comfortable with this build system update, both on the ops and spark project side. we were incredibly far behind the release cycle for k8s and

Re: [VOTE] Release Apache Spark 2.4.1 (RC2)

2019-03-07 Thread DB Tsai
Saisai, There is no blocker now. I ran into some difficulties in publishing the jars into Nexus. The publish task was finished, but Nexus gave me the following error. *failureMessage Failed to validate the pgp signature of

Re: [VOTE] Release Apache Spark 2.4.1 (RC2)

2019-03-07 Thread Felix Cheung
There is SPARK-26604 we are looking into From: Saisai Shao Sent: Wednesday, March 6, 2019 6:05 PM To: shane knapp Cc: Stavros Kontopoulos; Sean Owen; DB Tsai; Spark dev list; d_t...@apple.com Subject: Re: [VOTE] Release Apache Spark 2.4.1 (RC2) Do we have other

Re: [VOTE] Release Apache Spark 2.4.1 (RC2)

2019-03-07 Thread Sean Owen
Do you know what change fixed it? If it's not a regression from 2.4.0 it wouldn't necessarily go into a maintenance release. If there were no downside, maybe; does it cause any incompatibility with older HBase versions? It may be that this support is targeted for Spark 3 on purpose, which is

Re: [VOTE] Release Apache Spark 2.4.1 (RC2)

2019-03-07 Thread Jakub Wozniak
Hello, I have a question regarding the 2.4.1 release. It looks like Spark 2.4 (and 2.4.1-rc) is not exactly compatible with Hbase 2.x+ for the Yarn mode. The problem is in the org.apache.spark.deploy.security.HbaseDelegationTokenProvider class that expects a specific version of TokenUtil

Re: Two spark applications listen on same port on same machine

2019-03-07 Thread Moein Hosseini
I'm sure just the first one listen on port, but in master UI, both application redirects to same machie, same port. Just as I checked url, they redirects to application ui of first sumbitted one. So I think it could be only problem in UI. On Wed, Mar 6, 2019, 10:29 PM Sean Owen wrote: > Two