date:20160711

Re: Spark SQL: Merge Arrays/Sets

2016-07-11 Thread Yash Sharma

This answers exactly what you are looking for -

http://stackoverflow.com/a/34204640/1562474

On Tue, Jul 12, 2016 at 6:40 AM, Pedro Rodriguez 
wrote:

> Is it possible with Spark SQL to merge columns whose types are Arrays or
> Sets?
>
> My use case would be something like this:
>
> DF types
> id: String
> words: Array[String]
>
> I would want to do something like
>
> df.groupBy('id).agg(merge_arrays('words)) -> list of all words
> df.groupBy('id).agg(merge_sets('words)) -> list of distinct words
>
> Thanks,
> --
> Pedro Rodriguez
> PhD Student in Distributed Machine Learning | CU Boulder
> UC Berkeley AMPLab Alumni
>
> ski.rodrig...@gmail.com | pedrorodriguez.io | 909-353-4423
> Github: github.com/EntilZha | LinkedIn:
> https://www.linkedin.com/in/pedrorodriguezscience
>
>

Re: Fast database with writes per second and horizontal scaling

2016-07-11 Thread Yash Sharma

Spark is more of an execution engine rather than a database. Hive is a data
warehouse but I still like treating it as an execution engine.

For databases, You could compare HBase and Cassandra as they both have very
wide usage and proven performance. We have used Cassandra in the past and
were very happy with the results. You should move this discussion on
Cassandra's/HBase's mailing list for better advice.

Cheers

On Tue, Jul 12, 2016 at 3:23 PM, ayan guha  wrote:

> HI
>
> HBase is pretty neat itself. But speed is not the criteria to choose Hbase
> over Cassandra (or vicey versa).. Slowness can very well because of design
> issues, and unfortunately it will not help changing technology in that case
> :)
>
> I would suggest you to quantify "slow"-ness in conjunction
> with infrastructure you have and I am sure good people here will help.
>
> Best
> Ayan
>
> On Tue, Jul 12, 2016 at 3:01 PM, Ashok Kumar  > wrote:
>
>> Anyone in Spark as well
>>
>> My colleague has been using Cassandra. However, he says it is too slow
>> and not user friendly/
>> MongodDB as a doc databases is pretty neat but not fast enough
>>
>> May main concern is fast writes per second and good scaling.
>>
>>
>> Hive on Spark or Tez?
>>
>> How about Hbase. or anything else
>>
>> Any expert advice warmly acknowledged..
>>
>> thanking yo
>>
>>
>> On Monday, 11 July 2016, 17:24, Ashok Kumar  wrote:
>>
>>
>> Hi Gurus,
>>
>> Advice appreciated from Hive gurus.
>>
>> My colleague has been using Cassandra. However, he says it is too slow
>> and not user friendly/
>> MongodDB as a doc databases is pretty neat but not fast enough
>>
>> May main concern is fast writes per second and good scaling.
>>
>>
>> Hive on Spark or Tez?
>>
>> How about Hbase. or anything else
>>
>> Any expert advice warmly acknowledged..
>>
>> thanking you
>>
>>
>>
>
>
> --
> Best Regards,
> Ayan Guha
>

Re: Spark cluster tuning recommendation

2016-07-11 Thread Yash Sharma

I would say use the dynamic allocation rather than number of executors.
Provide some executor memory which you would like.
Deciding the values requires couple of test runs and checking what works
best for you.

You could try something like -

--driver-memory 1G \
--executor-memory 2G \
--executor-cores 2 \
--conf spark.dynamicAllocation.enabled=true \
--conf spark.dynamicAllocation.initialExecutors=8 \



On Tue, Jul 12, 2016 at 1:27 PM, Anuj Kumar  wrote:

> That configuration looks bad. With only two cores in use and 1GB used by
> the app. Few points-
>
> 1. Please oversubscribe those CPUs to at-least twice the amount of cores
> you have to start-with and then tune if it freezes
> 2. Allocate all of the CPU cores and memory to your running app (I assume
> it is your test environment)
> 3. Assuming that you are running a quad core machine if you define cores
> as 8 for your workers you will get 56 cores (CPU threads)
> 4. Also, it depends on the source from where you are reading the data. If
> you are reading from HDFS, what is your block size and part count?
> 5. You may also have to tune the timeouts and frame-size based on the
> dataset and errors that you are facing
>
> We have run terasort with couple of high-end worker machines RW from HDFS
> with 5-10 mount points allocated for HDFS and Spark local. We have used
> multiple configuration, like-
> 10W-10CPU-10GB, 25W-6CPU-6GB running on each of the two machines with HDFS
> 512MB blocks and 1000-2000 parts. All these guys chatting at 10Gbe, worked
> well.
>
> On Tue, Jul 12, 2016 at 3:39 AM, Kartik Mathur 
> wrote:
>
>> I am trying a run terasort in spark , for a 7 node cluster with only 10g
>> of data and executors get lost with GC overhead limit exceeded error.
>>
>> This is what my cluster looks like -
>>
>>
>>- *Alive Workers:* 7
>>- *Cores in use:* 28 Total, 2 Used
>>- *Memory in use:* 56.0 GB Total, 1024.0 MB Used
>>- *Applications:* 1 Running, 6 Completed
>>- *Drivers:* 0 Running, 0 Completed
>>- *Status:* ALIVE
>>
>> Each worker has 8 cores and 4GB memory.
>>
>> My questions is how do people running in production decide these
>> properties -
>>
>> 1) --num-executors
>> 2) --executor-cores
>> 3) --executor-memory
>> 4) num of partitions
>> 5) spark.default.parallelism
>>
>> Thanks,
>> Kartik
>>
>>
>>
>

Fwd: Fast database with writes per second and horizontal scaling

2016-07-11 Thread ayan guha

HI

HBase is pretty neat itself. But speed is not the criteria to choose Hbase
over Cassandra (or vicey versa).. Slowness can very well because of design
issues, and unfortunately it will not help changing technology in that case
:)

I would suggest you to quantify "slow"-ness in conjunction
with infrastructure you have and I am sure good people here will help.

Best
Ayan

On Tue, Jul 12, 2016 at 3:01 PM, Ashok Kumar 
wrote:

> Anyone in Spark as well
>
> My colleague has been using Cassandra. However, he says it is too slow
> and not user friendly/
> MongodDB as a doc databases is pretty neat but not fast enough
>
> May main concern is fast writes per second and good scaling.
>
>
> Hive on Spark or Tez?
>
> How about Hbase. or anything else
>
> Any expert advice warmly acknowledged..
>
> thanking yo
>
>
> On Monday, 11 July 2016, 17:24, Ashok Kumar  wrote:
>
>
> Hi Gurus,
>
> Advice appreciated from Hive gurus.
>
> My colleague has been using Cassandra. However, he says it is too slow
> and not user friendly/
> MongodDB as a doc databases is pretty neat but not fast enough
>
> May main concern is fast writes per second and good scaling.
>
>
> Hive on Spark or Tez?
>
> How about Hbase. or anything else
>
> Any expert advice warmly acknowledged..
>
> thanking you
>
>
>

-- 
Best Regards,
Ayan Guha

Re: Fast database with writes per second and horizontal scaling

2016-07-11 Thread Ashok Kumar

Anyone in Spark as well
My colleague has been using Cassandra. However, he says it is too slow and not 
user friendly/MongodDB as a doc databases is pretty neat but not fast enough
May main concern is fast writes per second and good scaling.

Hive on Spark or Tez?
How about Hbase. or anything else
Any expert advice warmly acknowledged..
thanking yo 

On Monday, 11 July 2016, 17:24, Ashok Kumar  wrote:
 

 Hi Gurus,
Advice appreciated from Hive gurus.
My colleague has been using Cassandra. However, he says it is too slow and not 
user friendly/MongodDB as a doc databases is pretty neat but not fast enough
May main concern is fast writes per second and good scaling.

Hive on Spark or Tez?
How about Hbase. or anything else
Any expert advice warmly acknowledged..
thanking you

Re: Spark cluster tuning recommendation

2016-07-11 Thread Anuj Kumar

That configuration looks bad. With only two cores in use and 1GB used by
the app. Few points-

1. Please oversubscribe those CPUs to at-least twice the amount of cores
you have to start-with and then tune if it freezes
2. Allocate all of the CPU cores and memory to your running app (I assume
it is your test environment)
3. Assuming that you are running a quad core machine if you define cores as
8 for your workers you will get 56 cores (CPU threads)
4. Also, it depends on the source from where you are reading the data. If
you are reading from HDFS, what is your block size and part count?
5. You may also have to tune the timeouts and frame-size based on the
dataset and errors that you are facing

We have run terasort with couple of high-end worker machines RW from HDFS
with 5-10 mount points allocated for HDFS and Spark local. We have used
multiple configuration, like-
10W-10CPU-10GB, 25W-6CPU-6GB running on each of the two machines with HDFS
512MB blocks and 1000-2000 parts. All these guys chatting at 10Gbe, worked
well.

On Tue, Jul 12, 2016 at 3:39 AM, Kartik Mathur  wrote:

> I am trying a run terasort in spark , for a 7 node cluster with only 10g
> of data and executors get lost with GC overhead limit exceeded error.
>
> This is what my cluster looks like -
>
>
>- *Alive Workers:* 7
>- *Cores in use:* 28 Total, 2 Used
>- *Memory in use:* 56.0 GB Total, 1024.0 MB Used
>- *Applications:* 1 Running, 6 Completed
>- *Drivers:* 0 Running, 0 Completed
>- *Status:* ALIVE
>
> Each worker has 8 cores and 4GB memory.
>
> My questions is how do people running in production decide these
> properties -
>
> 1) --num-executors
> 2) --executor-cores
> 3) --executor-memory
> 4) num of partitions
> 5) spark.default.parallelism
>
> Thanks,
> Kartik
>
>
>

Complications with saving Kafka offsets?

2016-07-11 Thread BradleyUM

I'm working on a Spark Streaming (1.6.0) project and one of our requirements
is to persist Kafka offsets to Zookeeper after a batch has completed so that
we can restart work from the correct position if we have to restart the
process for any reason. Many links,
http://spark.apache.org/docs/latest/streaming-kafka-integration.html
included, seem to suggest that calling transform() on the stream is a
perfectly acceptable way to store the offsets off for processing when the
batch completes. Since that method seems to offer more intuitive ordering
guarantees than foreachRDD() we have, up until now, preferred it. So our
code looks something like the following:

AtomicReference savedOffsets = new AtomicReference<>();

messages = messages.transformToPair((rdd) -> {
// Save the offsets so that we can update ZK with them later
HasOffsetRanges hasOffsetRanges = (HasOffsetRanges)rdd.rdd();
savedOffsets.set(hasOffsetRanges.offsetRanges());
}

Unfortunately we've discovered that this doesn't work, as contrary to
expectations the logic inside of transformToPair() seems to run whenever a
new batch gets added, even if we're not prepared to process it yet. So
savedOffsets will store the offsets of the most recently enqueued batch, not
necessarily the one being processed. When a batch completes, then, the
offset we save to ZK may reflect enqueued data that we haven't actually
processed yet. This can (and has) created conditions where a crash causes us
to restart from the wrong position and drop data.

There seem to be two solutions to this, from what I can tell:

1.) A brief test using foreachRDD() instead of transform() seems to behave
more in line with expectations, with the call only being made when a batch
actually begins to process. I have yet to find an explanation as to why the
two methods differ in this way.
2.) Instead of using an AtomicReference we tried a queue of offsets. Our
logic pushes a set of offsets at the start of a batch and pulls off the
oldest at the end - the idea is that the one being pulled will always
reflect the most recently processed, not one from the queue. Since we're not
100% on whether Spark guarantees this we also have logic to assert that the
batch that was completed has the same RDD ID as the one we're pulling from
the queue.

However, I have yet to find anything, on this list or elsewhere, that
suggests that either of these two approaches is necessary. Does what I've
described match anyone else's experience? Is the behavior I'm seeing from
the transform() method expected? Do both of the solutions I've proposed seem
legitimate, or is there some complication that I've failed to account for?

Any help is appreciated.

- Bradley

--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Complications-with-saving-Kafka-offsets-tp27324.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

62 matches

Mail list logo