Re: [Pyspark, SQL] Very slow IN operator

2017-04-05 Thread Michael Segel
Just out of curiosity, what would happen if you put your 10K values in to a temp table and then did a join against it? > On Apr 5, 2017, at 4:30 PM, Maciej Bryński wrote: > > Hi, > I'm trying to run queries with many values in IN operator. > > The result is that for more

Re: Handling questions in the mailing lists

2016-11-08 Thread Michael Segel
Guys… please take what I say with a grain of salt… The issue is that the input is a stream of messages where they are addressed in a LIFO manner. This means that messages may be ignored. The stream of data (user@spark for example) is semi-structured in that the stream contains a lot of

Indexing w spark joins?

2016-10-17 Thread Michael Segel
Hi, Apologies if I’ve asked this question before but I didn’t see it in the list and I’m certain that my last surviving brain cell has gone on strike over my attempt to reduce my caffeine intake… Posting this to both user and dev because I think the question / topic jumps in to both camps.

Re: Spark SQL JSON Column Support

2016-09-28 Thread Michael Segel
Silly question? When you talk about ‘user specified schema’ do you mean for the user to supply an additional schema, or that you’re using the schema that’s described by the JSON string? (or both? [either/or] ) Thx On Sep 28, 2016, at 12:52 PM, Michael Armbrust

Re: Spark Thrift Server Concurrency

2016-06-23 Thread Michael Segel
Hi, There are a lot of moving parts and a lot of unknowns from your description. Besides the version stuff. How many executors, how many cores? How much memory? Are you persisting (memory and disk) or just caching (memory) During the execution… same tables… are you seeing a lot of

Re: Secondary Indexing?

2016-05-30 Thread Michael Segel
le/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw> > > http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/> > > > On 30 May 2016 at 17:08, Michael Segel <msegel_had...@hotmail.com > <mailto:msegel_had...@hotmail.com>> wrote:

Secondary Indexing?

2016-05-30 Thread Michael Segel
I’m not sure where to post this since its a bit of a philosophical question in terms of design and vision for spark. If we look at SparkSQL and performance… where does Secondary indexing fit in? The reason this is a bit awkward is that if you view Spark as querying RDDs which are temporary,

Indexing of RDDs and DF in 2.0?

2016-05-17 Thread Michael Segel
Hi, I saw a replay of a talk about what’s coming in Spark 2.0 and the speed performances… I am curious about indexing of data sets. In HBase/MapRDB you can create ordered sets of indexes through an inverted table. Here, you can take the intersection of the indexes to find the result set of

Re: Any documentation on Spark's security model beyond YARN?

2016-04-01 Thread Michael Segel
gt; On Wed, Mar 30, 2016 at 4:33 AM, Steve Loughran <ste...@hortonworks.com> >> wrote: >>> >>>> On 29 Mar 2016, at 22:19, Michael Segel <msegel_had...@hotmail.com> wrote: >>>> >>>> Hi, >>>> >>>> So yeah, I kn

Any documentation on Spark's security model beyond YARN?

2016-03-29 Thread Michael Segel
Hi, So yeah, I know that Spark jobs running on a Hadoop cluster will inherit its security from the underlying YARN job. However… that’s not really saying much when you think about some use cases. Like using the thrift service … I’m wondering what else is new and what people have been

Silly question about building Spark 1.4.1

2015-07-20 Thread Michael Segel
Hi, I’m looking at the online docs for building spark 1.4.1 … http://spark.apache.org/docs/latest/building-spark.html http://spark.apache.org/docs/latest/building-spark.html I was interested in building spark for Scala 2.11 (latest scala) and also for Hive and JDBC support. The docs say: