Re: Implementing Upsert logic Through Streaming

2019-06-30 Thread Sachit Murarka
Hi Chris,

I have to make sure my DB has updated value for any record at a given point
of time.
Say following is data. I have to take 4th row for EmpId 2.
Also if any Emp details are already there in Oracle.  I have to update it
with latest value in the stream.

EmpId,  salary,  timestamp
1, 1000 , 1234
2, 2000, 2234
3, 2000,3234
2, 2100,4234

Thanks
Sachit

On Mon, 1 Jul 2019, 01:46 Chris Teoh,  wrote:

> Just thinking on this, if your needs can be addressed using batch instead
> of streaming, I think this is a viable solution. Using a lambda
> architecture approach seems like a possible solution.
>
> On Sun., 30 Jun. 2019, 9:54 am Chris Teoh,  wrote:
>
>> Not sure what your needs are here.
>>
>> If you can afford to wait, increase your micro batch windows to a long
>> period of time, aggregate your data by key every micro batch and then apply
>> those changes to the Oracle database.
>>
>> Since you're using text file to stream, there's no way to pre partition
>> your stream. If you're using Kafka, you could partition by record key and
>> do the summarisation that way before applying the changes to Oracle.
>>
>> I hope that helps.
>>
>> On Tue., 25 Jun. 2019, 9:43 pm Sachit Murarka, 
>> wrote:
>>
>>> Hi All,
>>>
>>> I will get records continously in text file form(Streaming). It will
>>> have timestamp as field also.
>>>
>>> Target is Oracle Database.
>>>
>>> My Goal is to maintain latest record for a key in Oracle. Could you
>>> please suggest how this can be implemented efficiently?
>>>
>>> Kind Regards,
>>> Sachit Murarka
>>>
>>


Re: Implementing Upsert logic Through Streaming

2019-06-30 Thread Chris Teoh
Just thinking on this, if your needs can be addressed using batch instead
of streaming, I think this is a viable solution. Using a lambda
architecture approach seems like a possible solution.

On Sun., 30 Jun. 2019, 9:54 am Chris Teoh,  wrote:

> Not sure what your needs are here.
>
> If you can afford to wait, increase your micro batch windows to a long
> period of time, aggregate your data by key every micro batch and then apply
> those changes to the Oracle database.
>
> Since you're using text file to stream, there's no way to pre partition
> your stream. If you're using Kafka, you could partition by record key and
> do the summarisation that way before applying the changes to Oracle.
>
> I hope that helps.
>
> On Tue., 25 Jun. 2019, 9:43 pm Sachit Murarka, 
> wrote:
>
>> Hi All,
>>
>> I will get records continously in text file form(Streaming). It will have
>> timestamp as field also.
>>
>> Target is Oracle Database.
>>
>> My Goal is to maintain latest record for a key in Oracle. Could you
>> please suggest how this can be implemented efficiently?
>>
>> Kind Regards,
>> Sachit Murarka
>>
>


k8s orchestrating Spark service

2019-06-30 Thread Pat Ferrel
We're trying to setup a system that includes Spark. The rest of the
services have good Docker containers and Helm charts to start from.

Spark on the other hand is proving difficult. We forked a container and
have tried to create our own chart but are having several problems with
this.

So back to the community… Can anyone recommend a Docker Container + Helm
Chart for use with Kubernetes to orchestrate:

   - Spark standalone Master
   - several Spark Workers/Executors

This not a request to use k8s to orchestrate Spark Jobs, but the service
cluster itself.

Thanks


Re: Map side join without broadcast

2019-06-30 Thread Rahul Nandi
You can implement custom partitioner to do the bucketing.

On Sun, Jun 30, 2019 at 5:15 AM Chris Teoh  wrote:

> The closest thing I can think of here is if you have both dataframes
> written out using buckets. Hive uses this technique for join optimisation
> such that both datasets of the same bucket are read by the same mapper to
> achieve map side joins.
>
> On Sat., 29 Jun. 2019, 9:10 pm jelmer,  wrote:
>
>> I have 2 dataframes,
>>
>> Dataframe A which contains 1 element per partition that is gigabytes big
>> (an index)
>>
>> Dataframe B which is made up out of millions of small rows.
>>
>> I want to join B on A but i want all the work to be done on the executors
>> holding the partitions of dataframe A
>>
>> Is there a way to accomplish this without putting dataframe B in a
>> broadcast variable or doing a broadcast join ?
>>
>>


Re: Map side join without broadcast

2019-06-30 Thread jelmer
Does something like the code below make any sense or would there be a more
efficient way to do it ?

val wordsOnOnePartition = input
>   .map { word => Math.abs(word.id.hashCode) % numPartitions -> word }
>   .partitionBy(new PartitionIdPassthrough(numPartitions))
> val indices = wordsOnOnePartition
> .mapPartitions(it => new IndexIterator(it, m))
> .cache()
> val wordsOnEachPartition = input
>   .flatMap(word => 0 until numPartitions map { partition => partition
> -> word } )
>   .partitionBy(new PartitionIdPassthrough(numPartitions))
> val nearest = indices.join(wordsOnEachPartition)
>   .flatMap { case (_, (index, Word(word, vector))) =>
> index.findNearest(vector, k + 1).collect {
>   case SearchResult(Word(relatedWord, _), score) if relatedWord !=
> word =>
> RelatedItem(word, relatedWord, score)
> }
> .take(k)
>   }
> val result = nearest.groupBy(_.word).map { case (word, relatedItems) =>
> word +: relatedItems.toSeq
> .sortBy(_.similarity)(Ordering[Double].reverse)
> .map(_.relatedWord)
> .take(k)
> .mkString("\t")
> }
>

I manually assign a partition to each word of a list of words, and
repartition the rdd by this partition key

There i use mapPartitions to construct a partial index so i end up with one
index in each partition.

Then i read the words again but this time assign every partition to each
word and join it on the indices rdd by partition key. So effectively every
index will be queries

Finally i merge the results from each index into a single  list keeping
only the most relevant items by doing a groupBy



On Sun, 30 Jun 2019 at 01:45, Chris Teoh  wrote:

> The closest thing I can think of here is if you have both dataframes
> written out using buckets. Hive uses this technique for join optimisation
> such that both datasets of the same bucket are read by the same mapper to
> achieve map side joins.
>
> On Sat., 29 Jun. 2019, 9:10 pm jelmer,  wrote:
>
>> I have 2 dataframes,
>>
>> Dataframe A which contains 1 element per partition that is gigabytes big
>> (an index)
>>
>> Dataframe B which is made up out of millions of small rows.
>>
>> I want to join B on A but i want all the work to be done on the executors
>> holding the partitions of dataframe A
>>
>> Is there a way to accomplish this without putting dataframe B in a
>> broadcast variable or doing a broadcast join ?
>>
>>