Re: CVEs

2021-07-12 Thread Eric Richardson
Hi Sean and Holden,

I decided it was best to send an email so I could share all my findings
with the team. I think it should be relatively easy to fix with updates but
I am not that good at working on the repo. I tried but ended up with some
roadblocks that were going to take some time to figure out.

Thanks,
Eric

On Mon, Jun 21, 2021 at 5:45 PM Eric Richardson 
wrote:

> Ok, that sounds like a plan. I will gather what I found and either reach
> out on the security channel and/or try and upgrade with a pull request.
>
> Thanks for pointing me in the right direction.
>
> On Mon, Jun 21, 2021 at 4:52 PM Sean Owen  wrote:
>
>> Yeah if it were clearly exploitable right now we'd handle it via private@
>> instead of JIRA; depends on what you think the importance is. If in doubt
>> reply to priv...@spark.apache.org
>>
>> On Mon, Jun 21, 2021 at 6:50 PM Holden Karau 
>> wrote:
>>
>>> If you get to a point where you find something you think is highly
>>> likely a valid vulnerability the best path forward is likely reaching out
>>> to private@ to figure out how to do a security release.
>>>
>>> On Mon, Jun 21, 2021 at 4:42 PM Eric Richardson 
>>> wrote:
>>>
 Thanks for the quick reply. Yes, since it is included in the jars then
 it is unclear whether it is used internally at least to me.

 I can substitute the jar in the distro to avoid the scanner from
 finding it but then it is unclear whether I could be breaking something or
 not. Given that 3.1.2 is the latest release, I guess you might expect that
 it would pass the scanners but I am not sure if that version spans 3.0.x
 and 3.1.x or not either.

 I can report findings in an issue where I am pretty darn sure it is a
 valid vulnerability if that is ok? That at least would raise the
 visibility.

 Will 3.2.x be Scala 2.13.x only or cross compiled with 2.12?

 I realize Spark is a beast so I just want to help if I can but also not
 create extra work if it is not useful for me or the Spark 
 team/contributors.

 On Mon, Jun 21, 2021 at 3:43 PM Sean Owen  wrote:

> Whether it matters really depends on whether the CVE affects Spark.
> Sometimes it clearly could and so we'd try to back-port dependency updates
> to active branches.
> Sometimes it clearly doesn't and hey sometimes the dependency is
> updated anyway for good measure (mostly to keep this off static analyzer
> reports) but probably wouldn't backport.
>
> Jackson has been a persistent one but in this case Spark is already on
> 2.12.x in master, and it wasn't clear last time I looked at those CVEs 
> that
> they can affect Spark itself. End user apps perhaps, but those apps can
> supply their own Jackson.
>
> If someone had a legit view that this is potentially more serious I
> think we could _probably backport that update, but Jackson can be a little
> bit tricky with compatibility IIRC so would just bear some testing.
>
>
> On Mon, Jun 21, 2021 at 5:27 PM Eric Richardson <
> ekrichard...@gmail.com> wrote:
>
>> Hi,
>>
>> I am working with Spark 3.1.2 and getting several vulnerabilities
>> popping up. I am wondering if the Spark distros are scanned etc. and how
>> people resolve these.
>>
>> For example. I am finding -
>> https://nvd.nist.gov/vuln/detail/CVE-2020-25649
>>
>> This looks like it is fixed in 2.11.0 -
>> https://github.com/FasterXML/jackson-databind/issues/2589 - but
>> Spark supplies 2.10.0.
>>
>> Thanks,
>> Eric
>>
> --
>>> Twitter: https://twitter.com/holdenkarau
>>> Books (Learning Spark, High Performance Spark, etc.):
>>> https://amzn.to/2MaRAG9  
>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>
>>


Performance Improvement with Hive/Thrift Server

2021-07-12 Thread Artemis User
We are trying to switch from Postgres to the Spark's built-in Hive with 
Thrift server as the data sink to persist the ML result data, with the 
hope that Hive would improve the ML pipeline performance. However, it 
turned out that it took significantly longer for Hive to persist 
dataframes (via the SQL's saveAsTable API) for Postgres using JDBC.  
Does anyone have experienced similar problems with Hive?  Any 
recommendations in performance improvement would be highly appreciated.


We are using Spark in standalone mode.   I would assume that running 
Spark on a real Hive database or on simply on Hadoop would be more 
desired.  Has anyone done any performance comparison between running 
Spark with built-in Hive (with just the metastore) vs Spark on a 
full-fledged Hive DB vs Spark with built-in Hive on Hadoop? Thanks!


-- ND



-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Why planInputPartitions is called multiple times in a micro-batch?

2021-07-12 Thread kineret M
Hi,

I'm developing a new Spark connector using data source v2 API (spark 3.1.1).
I noticed that the planInputPartitions method (in MicroBatchStream) is
called twice every micro-batch.

What the motivation/reason is?

Thanks,
Kineret