Re: Spark SQL: Merge Arrays/Sets

2016-07-11 Thread Yash Sharma
This answers exactly what you are looking for -

http://stackoverflow.com/a/34204640/1562474

On Tue, Jul 12, 2016 at 6:40 AM, Pedro Rodriguez 
wrote:

> Is it possible with Spark SQL to merge columns whose types are Arrays or
> Sets?
>
> My use case would be something like this:
>
> DF types
> id: String
> words: Array[String]
>
> I would want to do something like
>
> df.groupBy('id).agg(merge_arrays('words)) -> list of all words
> df.groupBy('id).agg(merge_sets('words)) -> list of distinct words
>
> Thanks,
> --
> Pedro Rodriguez
> PhD Student in Distributed Machine Learning | CU Boulder
> UC Berkeley AMPLab Alumni
>
> ski.rodrig...@gmail.com | pedrorodriguez.io | 909-353-4423
> Github: github.com/EntilZha | LinkedIn:
> https://www.linkedin.com/in/pedrorodriguezscience
>
>


Re: Fast database with writes per second and horizontal scaling

2016-07-11 Thread Yash Sharma
Spark is more of an execution engine rather than a database. Hive is a data
warehouse but I still like treating it as an execution engine.

For databases, You could compare HBase and Cassandra as they both have very
wide usage and proven performance. We have used Cassandra in the past and
were very happy with the results. You should move this discussion on
Cassandra's/HBase's mailing list for better advice.

Cheers

On Tue, Jul 12, 2016 at 3:23 PM, ayan guha  wrote:

> HI
>
> HBase is pretty neat itself. But speed is not the criteria to choose Hbase
> over Cassandra (or vicey versa).. Slowness can very well because of design
> issues, and unfortunately it will not help changing technology in that case
> :)
>
> I would suggest you to quantify "slow"-ness in conjunction
> with infrastructure you have and I am sure good people here will help.
>
> Best
> Ayan
>
> On Tue, Jul 12, 2016 at 3:01 PM, Ashok Kumar  > wrote:
>
>> Anyone in Spark as well
>>
>> My colleague has been using Cassandra. However, he says it is too slow
>> and not user friendly/
>> MongodDB as a doc databases is pretty neat but not fast enough
>>
>> May main concern is fast writes per second and good scaling.
>>
>>
>> Hive on Spark or Tez?
>>
>> How about Hbase. or anything else
>>
>> Any expert advice warmly acknowledged..
>>
>> thanking yo
>>
>>
>> On Monday, 11 July 2016, 17:24, Ashok Kumar  wrote:
>>
>>
>> Hi Gurus,
>>
>> Advice appreciated from Hive gurus.
>>
>> My colleague has been using Cassandra. However, he says it is too slow
>> and not user friendly/
>> MongodDB as a doc databases is pretty neat but not fast enough
>>
>> May main concern is fast writes per second and good scaling.
>>
>>
>> Hive on Spark or Tez?
>>
>> How about Hbase. or anything else
>>
>> Any expert advice warmly acknowledged..
>>
>> thanking you
>>
>>
>>
>
>
> --
> Best Regards,
> Ayan Guha
>


Re: Spark cluster tuning recommendation

2016-07-11 Thread Yash Sharma
I would say use the dynamic allocation rather than number of executors.
Provide some executor memory which you would like.
Deciding the values requires couple of test runs and checking what works
best for you.

You could try something like -

--driver-memory 1G \
--executor-memory 2G \
--executor-cores 2 \
--conf spark.dynamicAllocation.enabled=true \
--conf spark.dynamicAllocation.initialExecutors=8 \



On Tue, Jul 12, 2016 at 1:27 PM, Anuj Kumar  wrote:

> That configuration looks bad. With only two cores in use and 1GB used by
> the app. Few points-
>
> 1. Please oversubscribe those CPUs to at-least twice the amount of cores
> you have to start-with and then tune if it freezes
> 2. Allocate all of the CPU cores and memory to your running app (I assume
> it is your test environment)
> 3. Assuming that you are running a quad core machine if you define cores
> as 8 for your workers you will get 56 cores (CPU threads)
> 4. Also, it depends on the source from where you are reading the data. If
> you are reading from HDFS, what is your block size and part count?
> 5. You may also have to tune the timeouts and frame-size based on the
> dataset and errors that you are facing
>
> We have run terasort with couple of high-end worker machines RW from HDFS
> with 5-10 mount points allocated for HDFS and Spark local. We have used
> multiple configuration, like-
> 10W-10CPU-10GB, 25W-6CPU-6GB running on each of the two machines with HDFS
> 512MB blocks and 1000-2000 parts. All these guys chatting at 10Gbe, worked
> well.
>
> On Tue, Jul 12, 2016 at 3:39 AM, Kartik Mathur 
> wrote:
>
>> I am trying a run terasort in spark , for a 7 node cluster with only 10g
>> of data and executors get lost with GC overhead limit exceeded error.
>>
>> This is what my cluster looks like -
>>
>>
>>- *Alive Workers:* 7
>>- *Cores in use:* 28 Total, 2 Used
>>- *Memory in use:* 56.0 GB Total, 1024.0 MB Used
>>- *Applications:* 1 Running, 6 Completed
>>- *Drivers:* 0 Running, 0 Completed
>>- *Status:* ALIVE
>>
>> Each worker has 8 cores and 4GB memory.
>>
>> My questions is how do people running in production decide these
>> properties -
>>
>> 1) --num-executors
>> 2) --executor-cores
>> 3) --executor-memory
>> 4) num of partitions
>> 5) spark.default.parallelism
>>
>> Thanks,
>> Kartik
>>
>>
>>
>


Fwd: Fast database with writes per second and horizontal scaling

2016-07-11 Thread ayan guha
HI

HBase is pretty neat itself. But speed is not the criteria to choose Hbase
over Cassandra (or vicey versa).. Slowness can very well because of design
issues, and unfortunately it will not help changing technology in that case
:)

I would suggest you to quantify "slow"-ness in conjunction
with infrastructure you have and I am sure good people here will help.

Best
Ayan

On Tue, Jul 12, 2016 at 3:01 PM, Ashok Kumar 
wrote:

> Anyone in Spark as well
>
> My colleague has been using Cassandra. However, he says it is too slow
> and not user friendly/
> MongodDB as a doc databases is pretty neat but not fast enough
>
> May main concern is fast writes per second and good scaling.
>
>
> Hive on Spark or Tez?
>
> How about Hbase. or anything else
>
> Any expert advice warmly acknowledged..
>
> thanking yo
>
>
> On Monday, 11 July 2016, 17:24, Ashok Kumar  wrote:
>
>
> Hi Gurus,
>
> Advice appreciated from Hive gurus.
>
> My colleague has been using Cassandra. However, he says it is too slow
> and not user friendly/
> MongodDB as a doc databases is pretty neat but not fast enough
>
> May main concern is fast writes per second and good scaling.
>
>
> Hive on Spark or Tez?
>
> How about Hbase. or anything else
>
> Any expert advice warmly acknowledged..
>
> thanking you
>
>
>


-- 
Best Regards,
Ayan Guha


Re: Fast database with writes per second and horizontal scaling

2016-07-11 Thread Ashok Kumar
Anyone in Spark as well
My colleague has been using Cassandra. However, he says it is too slow and not 
user friendly/MongodDB as a doc databases is pretty neat but not fast enough
May main concern is fast writes per second and good scaling.

Hive on Spark or Tez?
How about Hbase. or anything else
Any expert advice warmly acknowledged..
thanking yo 

On Monday, 11 July 2016, 17:24, Ashok Kumar  wrote:
 

 Hi Gurus,
Advice appreciated from Hive gurus.
My colleague has been using Cassandra. However, he says it is too slow and not 
user friendly/MongodDB as a doc databases is pretty neat but not fast enough
May main concern is fast writes per second and good scaling.

Hive on Spark or Tez?
How about Hbase. or anything else
Any expert advice warmly acknowledged..
thanking you

  

Re: Spark cluster tuning recommendation

2016-07-11 Thread Anuj Kumar
That configuration looks bad. With only two cores in use and 1GB used by
the app. Few points-

1. Please oversubscribe those CPUs to at-least twice the amount of cores
you have to start-with and then tune if it freezes
2. Allocate all of the CPU cores and memory to your running app (I assume
it is your test environment)
3. Assuming that you are running a quad core machine if you define cores as
8 for your workers you will get 56 cores (CPU threads)
4. Also, it depends on the source from where you are reading the data. If
you are reading from HDFS, what is your block size and part count?
5. You may also have to tune the timeouts and frame-size based on the
dataset and errors that you are facing

We have run terasort with couple of high-end worker machines RW from HDFS
with 5-10 mount points allocated for HDFS and Spark local. We have used
multiple configuration, like-
10W-10CPU-10GB, 25W-6CPU-6GB running on each of the two machines with HDFS
512MB blocks and 1000-2000 parts. All these guys chatting at 10Gbe, worked
well.

On Tue, Jul 12, 2016 at 3:39 AM, Kartik Mathur  wrote:

> I am trying a run terasort in spark , for a 7 node cluster with only 10g
> of data and executors get lost with GC overhead limit exceeded error.
>
> This is what my cluster looks like -
>
>
>- *Alive Workers:* 7
>- *Cores in use:* 28 Total, 2 Used
>- *Memory in use:* 56.0 GB Total, 1024.0 MB Used
>- *Applications:* 1 Running, 6 Completed
>- *Drivers:* 0 Running, 0 Completed
>- *Status:* ALIVE
>
> Each worker has 8 cores and 4GB memory.
>
> My questions is how do people running in production decide these
> properties -
>
> 1) --num-executors
> 2) --executor-cores
> 3) --executor-memory
> 4) num of partitions
> 5) spark.default.parallelism
>
> Thanks,
> Kartik
>
>
>


Complications with saving Kafka offsets?

2016-07-11 Thread BradleyUM
I'm working on a Spark Streaming (1.6.0) project and one of our requirements
is to persist Kafka offsets to Zookeeper after a batch has completed so that
we can restart work from the correct position if we have to restart the
process for any reason. Many links,
http://spark.apache.org/docs/latest/streaming-kafka-integration.html
included, seem to suggest that calling transform() on the stream is a
perfectly acceptable way to store the offsets off for processing when the
batch completes. Since that method seems to offer more intuitive ordering
guarantees than foreachRDD() we have, up until now, preferred it. So our
code looks something like the following:

AtomicReference savedOffsets = new AtomicReference<>();

messages = messages.transformToPair((rdd) -> {
  // Save the offsets so that we can update ZK with them later
  HasOffsetRanges hasOffsetRanges = (HasOffsetRanges)rdd.rdd();
  savedOffsets.set(hasOffsetRanges.offsetRanges());
}

Unfortunately we've discovered that this doesn't work, as contrary to
expectations the logic inside of transformToPair() seems to run whenever a
new batch gets added, even if we're not prepared to process it yet. So
savedOffsets will store the offsets of the most recently enqueued batch, not
necessarily the one being processed. When a batch completes, then, the
offset we save to ZK may reflect enqueued data that we haven't actually
processed yet. This can (and has) created conditions where a crash causes us
to restart from the wrong position and drop data.

There seem to be two solutions to this, from what I can tell:

1.) A brief test using foreachRDD() instead of transform() seems to behave
more in line with expectations, with the call only being made when a batch
actually begins to process. I have yet to find an explanation as to why the
two methods differ in this way.
2.) Instead of using an AtomicReference we tried a queue of offsets. Our
logic pushes a set of offsets at the start of a batch and pulls off the
oldest at the end - the idea is that the one being pulled will always
reflect the most recently processed, not one from the queue. Since we're not
100% on whether Spark guarantees this we also have logic to assert that the
batch that was completed has the same RDD ID as the one we're pulling from
the queue.

However, I have yet to find anything, on this list or elsewhere, that
suggests that either of these two approaches is necessary. Does what I've
described match anyone else's experience? Is the behavior I'm seeing from
the transform() method expected? Do both of the solutions I've proposed seem
legitimate, or is there some complication that I've failed to account for?

Any help is appreciated.

- Bradley



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Complications-with-saving-Kafka-offsets-tp27324.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-07-11 Thread ayan guha
ccHi Mich

Thanks for showing examples, makes perfect sense.

One question: "...I agree that on VLT (very large tables), the limitation
in available memory may be the overriding factor in using Spark"...have you
observed any specific threshold for VLT which tilts the favor against
Spark. For example, if I have a 10 node cluster with (say) 64G RAM and
8CPU, where I should expect Spark to crumble? What if my node is 128G RAM?

I know its difficult to answer these values empirically and YMMV depending
on cluster load, data format,  query etc. But is there a guesstimate around?

Best
Ayan

On Tue, Jul 12, 2016 at 9:22 AM, Mich Talebzadeh 
wrote:

> Another point with Hive on spark and Hive on Tez + LLAP, I am thinking
> loud :)
>
>
>1. I am using Hive on Spark and I have a table of 10GB say with 100
>users concurrently accessing the same partition of ORC table  (last one
>hour or so)
>2. Spark takes data and puts in in memory. I gather only data for that
>partition will be loaded for 100 users. In other words there will be 100
>copies.
>3. Spark unlike RDBMS does not have the notion of hot cache or Most
>Recently Used (MRU) or Least Recently Used. So once the user finishes data
>is released from Spark memory. The next user will load that data again.
>Potentially this is somehow wasteful of resources?
>4. With Tez we only have DAG. It is MR with DAG. So the same algorithm
>will be applied to 100 users session but no memory usage
>5. If I add LLAP, will that be more efficient in terms of memory usage
>compared to Hive or not? Will it keep the data in memory for reuse or not.
>6. What I don't understand what makes Tez and LLAP more efficient
>compared to Spark!
>
> Cheers
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 11 July 2016 at 21:54, Mich Talebzadeh 
> wrote:
>
>> In my test I did like for like keeping the systematic the same namely:
>>
>>
>>1. Table was a parquet table of 100 Million rows
>>2. The same set up was used for both Hive on Spark and Hive on MR
>>3. Spark was very impressive compared to MR on this particular test.
>>
>>
>> Just to see any issues I created an ORC table in in the image of Parquet
>> (insert/select from Parquet to ORC) with stats updated for columns etc
>>
>> These were the results of the same run using ORC table this time:
>>
>> hive> select max(id) from oraclehadoop.dummy;
>>
>> Starting Spark Job = b886b869-5500-4ef7-aab9-ae6fb4dad22b
>> Query Hive on Spark job[1] stages:
>> 2
>> 3
>> Status: Running (Hive on Spark job[1])
>> Job Progress Format
>> CurrentTime StageId_StageAttemptId:
>> SucceededTasksCount(+RunningTasksCount-FailedTasksCount)/TotalTasksCount
>> [StageCost]
>> 2016-07-11 21:35:45,020 Stage-2_0: 0(+8)/23 Stage-3_0: 0/1
>> 2016-07-11 21:35:48,033 Stage-2_0: 0(+8)/23 Stage-3_0: 0/1
>> 2016-07-11 21:35:51,046 Stage-2_0: 1(+8)/23 Stage-3_0: 0/1
>> 2016-07-11 21:35:52,050 Stage-2_0: 3(+8)/23 Stage-3_0: 0/1
>> 2016-07-11 21:35:53,055 Stage-2_0: 8(+4)/23 Stage-3_0: 0/1
>> 2016-07-11 21:35:54,060 Stage-2_0: 11(+1)/23Stage-3_0: 0/1
>> 2016-07-11 21:35:55,065 Stage-2_0: 12(+0)/23Stage-3_0: 0/1
>> 2016-07-11 21:35:56,071 Stage-2_0: 12(+8)/23Stage-3_0: 0/1
>> 2016-07-11 21:35:57,076 Stage-2_0: 13(+8)/23Stage-3_0: 0/1
>> 2016-07-11 21:35:58,081 Stage-2_0: 20(+3)/23Stage-3_0: 0/1
>> 2016-07-11 21:35:59,085 Stage-2_0: 23/23 Finished   Stage-3_0: 0(+1)/1
>> 2016-07-11 21:36:00,089 Stage-2_0: 23/23 Finished   Stage-3_0: 1/1
>> Finished
>> Status: Finished successfully in 16.08 seconds
>> OK
>> 1
>> Time taken: 17.775 seconds, Fetched: 1 row(s)
>>
>> Repeat with MR engine
>>
>> hive> set hive.execution.engine=mr;
>> Hive-on-MR is deprecated in Hive 2 and may not be available in the future
>> versions. Consider using a different execution engine (i.e. spark, tez) or
>> using Hive 1.X releases.
>>
>> hive> select max(id) from oraclehadoop.dummy;
>> WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in
>> the future versions. Consider using a different execution engine (i.e.
>> spark, tez) or using Hive 1.X releases.
>> Query ID = hduser_20160711213100_8dc2afae-8644-4097-ba33-c7bd3c304bf8
>> Total jobs = 1
>> Launching Job 1 out of 1
>> Number of reduce tasks determined at compile time: 1
>> In order to change the average load 

Re: Batch details are missing

2016-07-11 Thread C. Josephson
The solution ended up being upgrading from Spark 1.5 to Spark 1.6.1+

On Fri, Jun 24, 2016 at 2:57 PM, C. Josephson  wrote:

> We're trying to resolve some performance issues with spark streaming using
> the application UI, but the batch details page doesn't seem to be working.
> When I click on a batch in the streaming application UI, I expect to see
> something like this: http://i.stack.imgur.com/ApF8z.png
>
> But instead we see this:
> [image: Inline image 1]
>
> Any ideas why we aren't getting any job details? We are running pySpark
> 1.5.0.
>
> Thanks,
> -cjoseph
>



-- 
Colleen Josephson
Engineering Researcher
Uhana, Inc.


Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-07-11 Thread Mich Talebzadeh
Another point with Hive on spark and Hive on Tez + LLAP, I am thinking loud
:)


   1. I am using Hive on Spark and I have a table of 10GB say with 100
   users concurrently accessing the same partition of ORC table  (last one
   hour or so)
   2. Spark takes data and puts in in memory. I gather only data for that
   partition will be loaded for 100 users. In other words there will be 100
   copies.
   3. Spark unlike RDBMS does not have the notion of hot cache or Most
   Recently Used (MRU) or Least Recently Used. So once the user finishes data
   is released from Spark memory. The next user will load that data again.
   Potentially this is somehow wasteful of resources?
   4. With Tez we only have DAG. It is MR with DAG. So the same algorithm
   will be applied to 100 users session but no memory usage
   5. If I add LLAP, will that be more efficient in terms of memory usage
   compared to Hive or not? Will it keep the data in memory for reuse or not.
   6. What I don't understand what makes Tez and LLAP more efficient
   compared to Spark!

Cheers

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 11 July 2016 at 21:54, Mich Talebzadeh  wrote:

> In my test I did like for like keeping the systematic the same namely:
>
>
>1. Table was a parquet table of 100 Million rows
>2. The same set up was used for both Hive on Spark and Hive on MR
>3. Spark was very impressive compared to MR on this particular test.
>
>
> Just to see any issues I created an ORC table in in the image of Parquet
> (insert/select from Parquet to ORC) with stats updated for columns etc
>
> These were the results of the same run using ORC table this time:
>
> hive> select max(id) from oraclehadoop.dummy;
>
> Starting Spark Job = b886b869-5500-4ef7-aab9-ae6fb4dad22b
> Query Hive on Spark job[1] stages:
> 2
> 3
> Status: Running (Hive on Spark job[1])
> Job Progress Format
> CurrentTime StageId_StageAttemptId:
> SucceededTasksCount(+RunningTasksCount-FailedTasksCount)/TotalTasksCount
> [StageCost]
> 2016-07-11 21:35:45,020 Stage-2_0: 0(+8)/23 Stage-3_0: 0/1
> 2016-07-11 21:35:48,033 Stage-2_0: 0(+8)/23 Stage-3_0: 0/1
> 2016-07-11 21:35:51,046 Stage-2_0: 1(+8)/23 Stage-3_0: 0/1
> 2016-07-11 21:35:52,050 Stage-2_0: 3(+8)/23 Stage-3_0: 0/1
> 2016-07-11 21:35:53,055 Stage-2_0: 8(+4)/23 Stage-3_0: 0/1
> 2016-07-11 21:35:54,060 Stage-2_0: 11(+1)/23Stage-3_0: 0/1
> 2016-07-11 21:35:55,065 Stage-2_0: 12(+0)/23Stage-3_0: 0/1
> 2016-07-11 21:35:56,071 Stage-2_0: 12(+8)/23Stage-3_0: 0/1
> 2016-07-11 21:35:57,076 Stage-2_0: 13(+8)/23Stage-3_0: 0/1
> 2016-07-11 21:35:58,081 Stage-2_0: 20(+3)/23Stage-3_0: 0/1
> 2016-07-11 21:35:59,085 Stage-2_0: 23/23 Finished   Stage-3_0: 0(+1)/1
> 2016-07-11 21:36:00,089 Stage-2_0: 23/23 Finished   Stage-3_0: 1/1
> Finished
> Status: Finished successfully in 16.08 seconds
> OK
> 1
> Time taken: 17.775 seconds, Fetched: 1 row(s)
>
> Repeat with MR engine
>
> hive> set hive.execution.engine=mr;
> Hive-on-MR is deprecated in Hive 2 and may not be available in the future
> versions. Consider using a different execution engine (i.e. spark, tez) or
> using Hive 1.X releases.
>
> hive> select max(id) from oraclehadoop.dummy;
> WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in
> the future versions. Consider using a different execution engine (i.e.
> spark, tez) or using Hive 1.X releases.
> Query ID = hduser_20160711213100_8dc2afae-8644-4097-ba33-c7bd3c304bf8
> Total jobs = 1
> Launching Job 1 out of 1
> Number of reduce tasks determined at compile time: 1
> In order to change the average load for a reducer (in bytes):
>   set hive.exec.reducers.bytes.per.reducer=
> In order to limit the maximum number of reducers:
>   set hive.exec.reducers.max=
> In order to set a constant number of reducers:
>   set mapreduce.job.reduces=
> Starting Job = job_1468226887011_0008, Tracking URL =
> http://rhes564:8088/proxy/application_1468226887011_0008/
> Kill Command = /home/hduser/hadoop-2.6.0/bin/hadoop job  -kill
> job_1468226887011_0008
> Hadoop job information for Stage-1: number of mappers: 23; number of
> reducers: 1
> 2016-07-11 21:37:00,061 Stage-1 map = 0%,  reduce = 0%
> 2016-07-11 21:37:06,440 Stage-1 map = 4%,  reduce = 0%, Cumulative CPU
> 16.48 sec
> 2016-07-11 21:37:14,751 Stage-1 map = 9%,  reduce = 0%, Cumulative CPU
> 40.63 sec
> 2016-07-11 21:37:22,048 Stage-1 map = 13%,  reduce = 0%, 

chisqSelector in Python

2016-07-11 Thread Tobi Bosede
Hi all,

There is no python example for chisqSelector in python at the below link.
https://spark.apache.org/docs/1.4.1/mllib-feature-extraction.html#chisqselector

So I am converting the scala code to python. I "translated" the following
code

val discretizedData = data.map { lp =>
  LabeledPoint(lp.label, Vectors.dense(lp.features.toArray.map { x =>
x / 16 } ) )}

as:
*discretizedData = data.map(lambda lp: LabeledPoint(lp.label,
Vectors.dense(np.array(lp.features).map(lambda x: x / 16) ) ))*

However when I call selector.fit(discretizedData) I get this error. Any
thoughts on the problem? Thanks.

Py4JJavaError: An error occurred while calling o2184.fitChiSqSelector.
: org.apache.spark.SparkException: Job aborted due to stage failure:
Task 0 in stage 158.0 failed 4 times, most recent failure: Lost task
0.3 in stage 158.0 (TID 3078, node032.hadoop.cls04):
java.net.SocketException: Connection reset


Re: Error starting thrift server on Spark

2016-07-11 Thread Jacek Laskowski
Create the directory and start over. You've got history server enabled.

Jacek
On 11 Jul 2016 11:07 p.m., "Marco Colombo" 
wrote:

Hi all, I cannot start thrift server on spark 1.6.2
I've configured binding port and IP and left default metastore.
In logs I get:

16/07/11 22:51:46 INFO NettyBlockTransferService: Server created on 46717
16/07/11 22:51:46 INFO BlockManagerMaster: Trying to register BlockManager
16/07/11 22:51:46 INFO BlockManagerMasterEndpoint: Registering block
manager 10.0.2.15:46717 with 511.1 MB RAM, BlockManagerId(driver,
10.0.2.15, 46717)
16/07/11 22:51:46 INFO BlockManagerMaster: Registered BlockManager
16/07/11 22:51:46 INFO AppClient$ClientEndpoint: Executor updated:
app-20160711225146-/0 is now RUNNING
16/07/11 22:51:47 ERROR SparkContext: Error initializing SparkContext.
java.io.FileNotFoundException: File file:/tmp/spark-events does not exist
at
org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:534)
at
org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:747)
at
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:524)
at
org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:409)
at
org.apache.spark.scheduler.EventLoggingListener.start(EventLoggingListener.scala:100)
at org.apache.spark.SparkContext.(SparkContext.scala:549)
at
org.apache.spark.sql.hive.thriftserver.SparkSQLEnv$.init(SparkSQLEnv.scala:56)
at
org.apache.spark.sql.hive.thriftserver.HiveThriftServer2$.main(HiveThriftServer2.scala:76)
at
org.apache.spark.sql.hive.thriftserver.HiveThriftServer2.main(HiveThriftServer2.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
at
org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
at
org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
16/07/11 22:51:47 INFO SparkUI: Stopped Spark web UI at http://localdev:4040

Did anyone found a similar issue? Any suggestion on the root cause?

Thanks to all!


QuantileDiscretizer not working properly with big dataframes

2016-07-11 Thread Pasquinell Urbani
Hi all,

We have a dataframe with 2.5 millions of records and 13 features. We want
to perform a logistic regression with this data but first we neet to divide
each columns in discrete values using QuantileDiscretizer. This will
improve the performance of the model by avoiding outliers.

For small dataframes QuantileDiscretizer works perfect (see the ml example:
https://spark.apache.org/docs/1.6.0/ml-features.html#quantilediscretizer),
but for large data frames it tends to split the column in only the values 0
and 1 (despite the custom number of buckets is settled in to 5). Here is my
code:

val discretizer = new QuantileDiscretizer()
  .setInputCol("C4")
  .setOutputCol("C4_Q")
  .setNumBuckets(5)

val result = discretizer.fit(df3).transform(df3)
result.show()

I found the same problem presented here:
https://issues.apache.org/jira/browse/SPARK-13444 . But there is no
solution yet.

Do I am configuring the function in a bad way? Should I pre-process the
data (like z-scores)? Can somebody help me dealing with this?

Regards


Re: Spark hangs at "Removed broadcast_*"

2016-07-11 Thread dhruve ashar
Hi,

Can you check the time when the job actually finished from the logs. The
logs provided are too short and do not reveal meaningful information.



On Mon, Jul 11, 2016 at 9:50 AM, velvetbaldmime  wrote:

> Spark 2.0.0-preview
>
> We've got an app that uses a fairly big broadcast variable. We run this on
> a
> big EC2 instance, so deployment is in client-mode. Broadcasted variable is
> a
> massive Map[String, Array[String]].
>
> At the end of saveAsTextFile, the output in the folder seems to be complete
> and correct (apart from .crc files still being there) BUT the spark-submit
> process is stuck on, seemingly, removing the broadcast variable. The stuck
> logs look like this: http://pastebin.com/wpTqvArY
>
> My last run lasted for 12 hours after after doing saveAsTextFile - just
> sitting there. I did a jstack on driver process, most threads are parked:
> http://pastebin.com/E29JKVT7
>
> Full store: We used this code with Spark 1.5.0 and it worked, but then the
> data changed and something stopped fitting into Kryo's serialisation
> buffer.
> Increasing it didn't help, so I had to disable the KryoSerialiser. Tested
> it
> again - it hanged. Switched to 2.0.0-preview - seems like the same issue.
>
> I'm not quite sure what's even going on given that there's almost no CPU
> activity and no output in the logs, yet the output is not finalised like it
> used to before.
>
> Would appreciate any help, thanks
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-hangs-at-Removed-broadcast-tp27320.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


-- 
-Dhruve Ashar


Spark cluster tuning recommendation

2016-07-11 Thread Kartik Mathur
I am trying a run terasort in spark , for a 7 node cluster with only 10g of
data and executors get lost with GC overhead limit exceeded error.

This is what my cluster looks like -


   - *Alive Workers:* 7
   - *Cores in use:* 28 Total, 2 Used
   - *Memory in use:* 56.0 GB Total, 1024.0 MB Used
   - *Applications:* 1 Running, 6 Completed
   - *Drivers:* 0 Running, 0 Completed
   - *Status:* ALIVE

Each worker has 8 cores and 4GB memory.

My questions is how do people running in production decide these properties
-

1) --num-executors
2) --executor-cores
3) --executor-memory
4) num of partitions
5) spark.default.parallelism

Thanks,
Kartik


/spark-ec2 script: trouble using ganglia web ui spark 1.6.1

2016-07-11 Thread Andy Davidson
I created a cluster using spark-1.6.1-bin-hadoop2.6/ec2/spark-ec2 script.
The shows ganglia started how ever I am not able to access
http://ec2-54-215-230-73.us-west-1.compute.amazonaws.com:5080/ganglia. I
have tried using the private ip from with in my data center.



I d not see anything listing on port 5080.



Is there some additional step or configuration? (my AWS firewall knowledge
is limited)



Thanks



Andy





$ grep ganglia src/main/resources/scripts/launchCluster.sh.out

Initializing ganglia

[timing] ganglia init:  00h 00m 01s

Configuring /etc/ganglia/gmond.conf

Configuring /etc/ganglia/gmetad.conf

Configuring /etc/httpd/conf.d/ganglia.conf

Setting up ganglia

RSYNC'ing /etc/ganglia to slaves...

[timing] ganglia setup:  00h 00m 03s

Ganglia started at 
http://ec2-xxx.us-west-1.compute.amazonaws.com:5080/ganglia

$ 


bash-4.2# netstat -tulpn

Active Internet connections (only servers)

Proto Recv-Q Send-Q Local Address   Foreign Address
State   PID/Program name

tcp0  0 0.0.0.0:86520.0.0.0:*
LISTEN  3832/gmetad

tcp0  0 0.0.0.0:87870.0.0.0:*
LISTEN  2584/rserver

tcp0  0 0.0.0.0:36757   0.0.0.0:*
LISTEN  2905/java

tcp0  0 0.0.0.0:50070   0.0.0.0:*
LISTEN  2905/java

tcp0  0 0.0.0.0:22  0.0.0.0:*
LISTEN  2144/sshd

tcp0  0 127.0.0.1:631   0.0.0.0:*
LISTEN  2095/cupsd

tcp0  0 127.0.0.1:7000  0.0.0.0:*
LISTEN  6512/python3.4

tcp0  0 127.0.0.1:250.0.0.0:*
LISTEN  2183/sendmail

tcp0  0 0.0.0.0:43813   0.0.0.0:*
LISTEN  3093/java

tcp0  0 172.31.22.140:9000  0.0.0.0:*
LISTEN  2905/java

tcp0  0 0.0.0.0:86490.0.0.0:*
LISTEN  3810/gmond

tcp0  0 0.0.0.0:50090   0.0.0.0:*
LISTEN  3093/java

tcp0  0 0.0.0.0:86510.0.0.0:*
LISTEN  3832/gmetad

tcp0  0 :::8080 :::*
LISTEN  23719/java

tcp0  0 :::8081 :::*
LISTEN  5588/java

tcp0  0 :::172.31.22.140:6066   :::*
LISTEN  23719/java

tcp0  0 :::172.31.22.140:6067   :::*
LISTEN  5588/java

tcp0  0 :::22   :::*
LISTEN  2144/sshd

tcp0  0 ::1:631 :::*
LISTEN  2095/cupsd

tcp0  0 :::19998:::*
LISTEN  3709/java

tcp0  0 :::1:::*
LISTEN  3709/java

tcp0  0 :::172.31.22.140:7077   :::*
LISTEN  23719/java

tcp0  0 :::172.31.22.140:7078   :::*
LISTEN  5588/java

udp0  0 0.0.0.0:86490.0.0.0:*
3810/gmond 

udp0  0 0.0.0.0:631 0.0.0.0:*
2095/cupsd 

udp0  0 0.0.0.0:38546   0.0.0.0:*
2905/java  

udp0  0 0.0.0.0:68  0.0.0.0:*
1142/dhclient  

udp0  0 172.31.22.140:123   0.0.0.0:*
2168/ntpd  

udp0  0 127.0.0.1:123   0.0.0.0:*
2168/ntpd  

udp0  0 0.0.0.0:123 0.0.0.0:*
2168/ntpd   




Re: Spark Streaming - Direct Approach

2016-07-11 Thread Tathagata Das
Aah, the docs have not been updated. They are totally in production in many
place. Others should chime in as well.

On Mon, Jul 11, 2016 at 1:43 PM, Mail.com  wrote:

> Hi All,
>
> Can someone please confirm if streaming direct approach for reading Kafka
> is still experimental or can it be used for production use.
>
> I see the documentation and talk from TD suggesting the advantages of the
> approach but docs state it is an "experimental" feature.
>
> Please suggest
>
> Thanks,
> Pradeep
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: Spark Streaming - Direct Approach

2016-07-11 Thread Andy Davidson
Hi Pradeep

I can not comment about experimental or production, how ever I recently
started a POC using direct approach.

Its been running off and on for about 2 weeks. In general it seems to work
really well. One thing that is not clear to me is how the cursor is manage.
E.G. I have my topic set to delete after 1 hr. I have had some problem that
caused huge delaying. I.E. They are running after the data is deleted. After
lots of jobs fail it seems to recover. I.E. The cursor advance to valid
position.


I am running into problems how ever I do not think it has to do with the
direct approach. I think it has to do with writing to s3.

Andy

From:  "Mail.com" 
Date:  Monday, July 11, 2016 at 1:43 PM
To:  "user @spark" 
Subject:  Spark Streaming - Direct Approach

> Hi All,
> 
> Can someone please confirm if streaming direct approach for reading Kafka is
> still experimental or can it be used for production use.
> 
> I see the documentation and talk from TD suggesting the advantages of the
> approach but docs state it is an "experimental" feature.
> 
> Please suggest
> 
> Thanks,
> Pradeep
> 
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> 
> 




Error starting thrift server on Spark

2016-07-11 Thread Marco Colombo
Hi all, I cannot start thrift server on spark 1.6.2
I've configured binding port and IP and left default metastore.
In logs I get:

16/07/11 22:51:46 INFO NettyBlockTransferService: Server created on 46717
16/07/11 22:51:46 INFO BlockManagerMaster: Trying to register BlockManager
16/07/11 22:51:46 INFO BlockManagerMasterEndpoint: Registering block
manager 10.0.2.15:46717 with 511.1 MB RAM, BlockManagerId(driver,
10.0.2.15, 46717)
16/07/11 22:51:46 INFO BlockManagerMaster: Registered BlockManager
16/07/11 22:51:46 INFO AppClient$ClientEndpoint: Executor updated:
app-20160711225146-/0 is now RUNNING
16/07/11 22:51:47 ERROR SparkContext: Error initializing SparkContext.
java.io.FileNotFoundException: File file:/tmp/spark-events does not exist
at
org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:534)
at
org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:747)
at
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:524)
at
org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:409)
at
org.apache.spark.scheduler.EventLoggingListener.start(EventLoggingListener.scala:100)
at org.apache.spark.SparkContext.(SparkContext.scala:549)
at
org.apache.spark.sql.hive.thriftserver.SparkSQLEnv$.init(SparkSQLEnv.scala:56)
at
org.apache.spark.sql.hive.thriftserver.HiveThriftServer2$.main(HiveThriftServer2.scala:76)
at
org.apache.spark.sql.hive.thriftserver.HiveThriftServer2.main(HiveThriftServer2.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
at
org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
at
org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
16/07/11 22:51:47 INFO SparkUI: Stopped Spark web UI at http://localdev:4040

Did anyone found a similar issue? Any suggestion on the root cause?

Thanks to all!


Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-07-11 Thread Mich Talebzadeh
In my test I did like for like keeping the systematic the same namely:


   1. Table was a parquet table of 100 Million rows
   2. The same set up was used for both Hive on Spark and Hive on MR
   3. Spark was very impressive compared to MR on this particular test.


Just to see any issues I created an ORC table in in the image of Parquet
(insert/select from Parquet to ORC) with stats updated for columns etc

These were the results of the same run using ORC table this time:

hive> select max(id) from oraclehadoop.dummy;

Starting Spark Job = b886b869-5500-4ef7-aab9-ae6fb4dad22b
Query Hive on Spark job[1] stages:
2
3
Status: Running (Hive on Spark job[1])
Job Progress Format
CurrentTime StageId_StageAttemptId:
SucceededTasksCount(+RunningTasksCount-FailedTasksCount)/TotalTasksCount
[StageCost]
2016-07-11 21:35:45,020 Stage-2_0: 0(+8)/23 Stage-3_0: 0/1
2016-07-11 21:35:48,033 Stage-2_0: 0(+8)/23 Stage-3_0: 0/1
2016-07-11 21:35:51,046 Stage-2_0: 1(+8)/23 Stage-3_0: 0/1
2016-07-11 21:35:52,050 Stage-2_0: 3(+8)/23 Stage-3_0: 0/1
2016-07-11 21:35:53,055 Stage-2_0: 8(+4)/23 Stage-3_0: 0/1
2016-07-11 21:35:54,060 Stage-2_0: 11(+1)/23Stage-3_0: 0/1
2016-07-11 21:35:55,065 Stage-2_0: 12(+0)/23Stage-3_0: 0/1
2016-07-11 21:35:56,071 Stage-2_0: 12(+8)/23Stage-3_0: 0/1
2016-07-11 21:35:57,076 Stage-2_0: 13(+8)/23Stage-3_0: 0/1
2016-07-11 21:35:58,081 Stage-2_0: 20(+3)/23Stage-3_0: 0/1
2016-07-11 21:35:59,085 Stage-2_0: 23/23 Finished   Stage-3_0: 0(+1)/1
2016-07-11 21:36:00,089 Stage-2_0: 23/23 Finished   Stage-3_0: 1/1
Finished
Status: Finished successfully in 16.08 seconds
OK
1
Time taken: 17.775 seconds, Fetched: 1 row(s)

Repeat with MR engine

hive> set hive.execution.engine=mr;
Hive-on-MR is deprecated in Hive 2 and may not be available in the future
versions. Consider using a different execution engine (i.e. spark, tez) or
using Hive 1.X releases.

hive> select max(id) from oraclehadoop.dummy;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the
future versions. Consider using a different execution engine (i.e. spark,
tez) or using Hive 1.X releases.
Query ID = hduser_20160711213100_8dc2afae-8644-4097-ba33-c7bd3c304bf8
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=
In order to set a constant number of reducers:
  set mapreduce.job.reduces=
Starting Job = job_1468226887011_0008, Tracking URL =
http://rhes564:8088/proxy/application_1468226887011_0008/
Kill Command = /home/hduser/hadoop-2.6.0/bin/hadoop job  -kill
job_1468226887011_0008
Hadoop job information for Stage-1: number of mappers: 23; number of
reducers: 1
2016-07-11 21:37:00,061 Stage-1 map = 0%,  reduce = 0%
2016-07-11 21:37:06,440 Stage-1 map = 4%,  reduce = 0%, Cumulative CPU
16.48 sec
2016-07-11 21:37:14,751 Stage-1 map = 9%,  reduce = 0%, Cumulative CPU
40.63 sec
2016-07-11 21:37:22,048 Stage-1 map = 13%,  reduce = 0%, Cumulative CPU
58.88 sec
2016-07-11 21:37:30,412 Stage-1 map = 17%,  reduce = 0%, Cumulative CPU
80.72 sec
2016-07-11 21:37:37,707 Stage-1 map = 22%,  reduce = 0%, Cumulative CPU
103.43 sec
2016-07-11 21:37:45,999 Stage-1 map = 26%,  reduce = 0%, Cumulative CPU
125.93 sec
2016-07-11 21:37:54,300 Stage-1 map = 30%,  reduce = 0%, Cumulative CPU
147.17 sec
2016-07-11 21:38:01,538 Stage-1 map = 35%,  reduce = 0%, Cumulative CPU
166.56 sec
2016-07-11 21:38:08,807 Stage-1 map = 39%,  reduce = 0%, Cumulative CPU
189.29 sec
2016-07-11 21:38:17,115 Stage-1 map = 43%,  reduce = 0%, Cumulative CPU
211.03 sec
2016-07-11 21:38:24,363 Stage-1 map = 48%,  reduce = 0%, Cumulative CPU
235.68 sec
2016-07-11 21:38:32,638 Stage-1 map = 52%,  reduce = 0%, Cumulative CPU
258.27 sec
2016-07-11 21:38:40,916 Stage-1 map = 57%,  reduce = 0%, Cumulative CPU
278.44 sec
2016-07-11 21:38:49,206 Stage-1 map = 61%,  reduce = 0%, Cumulative CPU
300.35 sec
2016-07-11 21:38:58,524 Stage-1 map = 65%,  reduce = 0%, Cumulative CPU
322.89 sec
2016-07-11 21:39:07,889 Stage-1 map = 70%,  reduce = 0%, Cumulative CPU
344.8 sec
2016-07-11 21:39:16,151 Stage-1 map = 74%,  reduce = 0%, Cumulative CPU
367.77 sec
2016-07-11 21:39:25,456 Stage-1 map = 78%,  reduce = 0%, Cumulative CPU
391.82 sec
2016-07-11 21:39:33,725 Stage-1 map = 83%,  reduce = 0%, Cumulative CPU
415.48 sec
2016-07-11 21:39:43,037 Stage-1 map = 87%,  reduce = 0%, Cumulative CPU
436.09 sec
2016-07-11 21:39:51,292 Stage-1 map = 91%,  reduce = 0%, Cumulative CPU
459.4 sec
2016-07-11 21:39:59,563 Stage-1 map = 96%,  reduce = 0%, Cumulative CPU
477.92 sec
2016-07-11 21:40:05,760 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU
491.72 sec
2016-07-11 21:40:10,921 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU
499.37 sec
MapReduce Total cumulative CPU time: 8 minutes 19 seconds 370 msec
Ended Job = 

Spark Streaming - Direct Approach

2016-07-11 Thread Mail.com
Hi All,

Can someone please confirm if streaming direct approach for reading Kafka is 
still experimental or can it be used for production use.

I see the documentation and talk from TD suggesting the advantages of the 
approach but docs state it is an "experimental" feature. 

Please suggest

Thanks,
Pradeep

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Processing ion formatted messages in spark

2016-07-11 Thread pandees waran
All,

did anyone ever work on processing Ion formatted messages in Spark? Ion
format is superset of JSON. All JSONs are valid IONs, but the reverse is
not true.

For more details on Ion;
http://amznlabs.github.io/ion-docs/

Thanks.


Spark SQL: Merge Arrays/Sets

2016-07-11 Thread Pedro Rodriguez
Is it possible with Spark SQL to merge columns whose types are Arrays or
Sets?

My use case would be something like this:

DF types
id: String
words: Array[String]

I would want to do something like

df.groupBy('id).agg(merge_arrays('words)) -> list of all words
df.groupBy('id).agg(merge_sets('words)) -> list of distinct words

Thanks,
-- 
Pedro Rodriguez
PhD Student in Distributed Machine Learning | CU Boulder
UC Berkeley AMPLab Alumni

ski.rodrig...@gmail.com | pedrorodriguez.io | 909-353-4423
Github: github.com/EntilZha | LinkedIn:
https://www.linkedin.com/in/pedrorodriguezscience


Re: question about UDAF

2016-07-11 Thread Pedro Rodriguez
I am not sure I understand your code entirely, but I worked with UDAFs
Friday and over the weekend (
https://gist.github.com/EntilZha/3951769a011389fef25e930258c20a2a).

I think what is going on is that your "update" function is not defined
correctly. Update should take a possibly initialized or in progress buffer
and integrate new results in. Right now, you ignore the input row. What is
probably occurring is that the initialization value "" is setting the
buffer equal to the buffer itself which is "".

Merge is responsible for taking two buffers and merging them together. In
this case, the buffers are "" since initialize makes it "" and update keeps
it "" so the result is just "". I am not sure it matters, but you probably
also want to do buffer.getString(0).

Pedro

On Mon, Jul 11, 2016 at 3:04 AM,  wrote:

> hello guys:
>  I have a DF and a UDAF. this DF has 2 columns, lp_location_id , id,
> both are of Int type. I want to group by id and aggregate all value of id
> into 1 string. So I used a UDAF to do this transformation: multi Int values
> to 1 String. However my UDAF returns empty values as the accessory attached.
>  Here is my code for my main class:
> val hc = new org.apache.spark.sql.hive.HiveContext(sc)
> val hiveTable = hc.sql("select lp_location_id,id from
> house_id_pv_location_top50")
>
> val jsonArray = new JsonArray
> val result =
> hiveTable.groupBy("lp_location_id").agg(jsonArray(col("id")).as("jsonArray")).collect.foreach(println)
>
> --
>  Here is my code of my UDAF:
>
> class JsonArray extends UserDefinedAggregateFunction {
>   def inputSchema: org.apache.spark.sql.types.StructType =
> StructType(StructField("id", IntegerType) :: Nil)
>
>   def bufferSchema: StructType = StructType(
> StructField("id", StringType) :: Nil)
>
>   def dataType: DataType = StringType
>
>   def deterministic: Boolean = true
>
>   def initialize(buffer: MutableAggregationBuffer): Unit = {
> buffer(0) = ""
>   }
>
>   def update(buffer: MutableAggregationBuffer, input: Row): Unit = {
> buffer(0) = buffer.getAs[Int](0)
>   }
>
>   def merge(buffer1: MutableAggregationBuffer, buffer2: Row): Unit = {
> val s1 = buffer1.getAs[Int](0).toString()
> val s2 = buffer2.getAs[Int](0).toString()
> buffer1(0) = s1.concat(s2)
>   }
>
>   def evaluate(buffer: Row): Any = {
> buffer(0)
>   }
> }
>
>
> I don't quit understand why I get empty result from my UDAF, I guess there
> may be 2 reason:
> 1. error initialization with "" in code of define initialize method
> 2. the buffer didn't write successfully.
>
> can anyone share a idea about this. thank you.
>
>
>
>
> 
>
> ThanksBest regards!
> San.Luo
>
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>



-- 
Pedro Rodriguez
PhD Student in Distributed Machine Learning | CU Boulder
UC Berkeley AMPLab Alumni

ski.rodrig...@gmail.com | pedrorodriguez.io | 909-353-4423
Github: github.com/EntilZha | LinkedIn:
https://www.linkedin.com/in/pedrorodriguezscience


trouble accessing driver log files using rest-api

2016-07-11 Thread Andy Davidson
I am running spark-1.6.1 and the stand alone cluster manager. I am running
into performance problems with spark streaming and added some extra metrics
to my log files. I submit my app in cluster mode. (I.e. The driver runs on a
slave not master)


I am not able to get the driver log files while the app is running using the
documented rest api
 
http://spark.apache.org/docs/latest/monitoring.html#rest-api

I think the issue is the rest-api give you access to the app log files. I
need the driver log file?


$ curl  http://$host/api/v1/applications/

[ {

  "id" : "app-20160711185337-0049",

  "name" : "gnip1",

  "attempts" : [ {

"startTime" : "2016-07-11T18:53:35.318GMT",

"endTime" : "1969-12-31T23:59:59.999GMT",

"sparkUser" : "",

"completed" : false

  } ]

} ][ec2-user@ip-172-31-22-140 tmp]$



$ curl -o$outputFile http://$host/api/v1/applications/$appID/logs



$outputFile will always be an empty zip file



If I use executors/. I get info about the drivers and executors how ever no
way to Œget' the log files. The driver output does not have any executorLogs
and the workers executorLogs are version of the log files rendered in HTML
not the actual log file.




$ curl http://$host/api/v1/applications/$appID/executors [ { "id" :
"driver", "hostPort" : "172.31.23.203:33303", "rddBlocks" : 0, "memoryUsed"
: 0, "diskUsed" : 0, "activeTasks" : 0, "failedTasks" : 0, "completedTasks"
: 0, "totalTasks" : 0, "totalDuration" : 0, "totalInputBytes" : 0,
"totalShuffleRead" : 0, "totalShuffleWrite" : 0, "maxMemory" : 535953408,
"executorLogs" : { } }, { "id" : "1", "hostPort" :
"ip-172-31-23-200.us-west-1.compute.internal:51560", "rddBlocks" : 218,
"memoryUsed" : 452224280, "diskUsed" : 0, "activeTasks" : 1, "failedTasks" :
0, "completedTasks" : 27756, "totalTasks" : 27757, "totalDuration" :
1650935, "totalInputBytes" : 9619224986, "totalShuffleRead" : 0,
"totalShuffleWrite" : 507615, "maxMemory" : 535953408, "executorLogs" : {
"stdout" : 
"http://ec2-xxx.compute.amazonaws.com:8081/logPage/?appId=app-20160711185337
-0049=1=stdout", "stderr" :
"http://ec2-xxx.us-west-1.compute.amazonaws.com:8081/logPage/?appId=app-2016
0711185337-0049=1=stderr" }

Any suggestions would be greatly appreciated

Andy





Re: Saving Table with Special Characters in Columns

2016-07-11 Thread Tobi Bosede
Thanks Michael!

But what about when I am not trying to save as parquet? No way around the
error using saveAsTable()? I am using Spark 1.4.

Tobi
On Jul 11, 2016 2:10 PM, "Michael Armbrust"  wrote:

> This is protecting you from a limitation in parquet.  The library will let
> you write out invalid files that can't be read back, so we added this check.
>
> You can call .format("csv") (in spark 2.0) to switch it to CSV.
>
> On Mon, Jul 11, 2016 at 11:16 AM, Tobi Bosede  wrote:
>
>> Hi everyone,
>>
>> I am trying to save a data frame with special characters in the column
>> names as a table in hive. However I am getting the following error. Is the
>> only solution to rename all the columns? Or is there some argument that can
>> be passed to into the saveAsTable() or write.parquet() functions to ignore
>> special characters?
>>
>> Py4JJavaError: An error occurred while calling o2956.saveAsTable.
>> : org.apache.spark.sql.AnalysisException: Attribute name "apple- 
>> mail_duration" contains invalid character(s) among " ,;{}()\n\t=". Please 
>> use alias to rename it.
>>
>>
>> If not how can I simply write the data frame as a csv file?
>>
>> Thanks,
>> Tobi
>>
>>
>>
>


Re: Custom Spark Error on Hadoop Cluster

2016-07-11 Thread Xiangrui Meng
(+user@spark. Please copy user@ so other people could see and help.)

The error message means you have an MLlib jar on the classpath but it
didn't contain ALS$StandardNNLSSolver. So it is either the modified jar not
deployed to the workers or there existing an unmodified MLlib jar sitting
in front of the modified one on the classpath. You can check the worker
logs and see the classpath used in launching the worker, and then check the
MLlib jars on that classpath. -Xiangrui

On Sun, Jul 10, 2016 at 10:18 PM Alger Remirata 
wrote:

> Hi Xiangrui,
>
> We have the modified jars deployed both on master and slave nodes.
>
> What do you mean by this line?: 1. The unmodified Spark jars were not on
> the classpath (already existed on the cluster or pulled in by other
> packages).
>
> How would I check that the unmodified Spark jars are not on the classpath?
> We change entirely the contents of the directory for SPARK_HOME. The newly
> built customized spark is the new contents of the current SPARK_HOME we
> have right now.
>
> Thanks,
>
> Alger
>
> On Fri, Jul 8, 2016 at 1:32 PM, Xiangrui Meng  wrote:
>
>> This seems like a deployment or dependency issue. Please check the
>> following:
>> 1. The unmodified Spark jars were not on the classpath (already existed
>> on the cluster or pulled in by other packages).
>> 2. The modified jars were indeed deployed to both master and slave nodes.
>>
>> On Tue, Jul 5, 2016 at 12:29 PM Alger Remirata 
>> wrote:
>>
>>> Hi all,
>>>
>>> First of all, we like to thank you for developing spark. This helps us a
>>> lot on our data science task.
>>>
>>> I have a question. We have build a customized spark using the following
>>> command:
>>> mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -Phive
>>> -Phive-thriftserver -DskipTests clean package.
>>>
>>> On the custom spark we built, we've added a new scala file or package
>>> called StandardNNLS file however it got an error saying:
>>>
>>> Name: org.apache.spark.SparkException
>>> Message: Job aborted due to stage failure: Task 21 in stage 34.0 failed
>>> 4 times, most recent failure: Lost task 21.3 in stage 34.0 (TID 2547,
>>> 192.168.60.115): java.lang.ClassNotFoundException:
>>> org.apache.spark.ml.recommendation.ALS$StandardNNLSSolver
>>>
>>> StandardNNLSolver is found on another scala file called
>>> StandardNNLS.scala
>>> as we replace the original NNLS solver from scala with StandardNNLS
>>> Do you guys have some idea about the error. Is there a config file we
>>> need to edit to add the classpath? Even if we insert the added codes in
>>> ALS.scala and not create another file like StandardNNLS.scala, the inserted
>>> code is not recognized. It still gets an error regarding
>>> ClassNotFoundException
>>>
>>> However, when we run this on our local machine and not on the hadoop
>>> cluster, it is working. We don't know if the error is because we are using
>>> mvn to build custom spark or it has something to do with communicating to
>>> hadoop cluster.
>>>
>>> We would like to ask some ideas from you how to solve this problem. We
>>> can actually create another package not dependent to Apache Spark but this
>>> is so slow. As of now, we are still learning scala and spark. Using Apache
>>> spark utilities make the code faster. However, if we'll make another
>>> package not dependent to apache spark, we have to recode the utilities that
>>> are set private in Apache Spark. So, it is better to use Apache Spark and
>>> insert some code that we can use.
>>>
>>> Thanks,
>>>
>>> Alger
>>>
>>
>


Re: Saving Table with Special Characters in Columns

2016-07-11 Thread Michael Armbrust
This is protecting you from a limitation in parquet.  The library will let
you write out invalid files that can't be read back, so we added this check.

You can call .format("csv") (in spark 2.0) to switch it to CSV.

On Mon, Jul 11, 2016 at 11:16 AM, Tobi Bosede  wrote:

> Hi everyone,
>
> I am trying to save a data frame with special characters in the column
> names as a table in hive. However I am getting the following error. Is the
> only solution to rename all the columns? Or is there some argument that can
> be passed to into the saveAsTable() or write.parquet() functions to ignore
> special characters?
>
> Py4JJavaError: An error occurred while calling o2956.saveAsTable.
> : org.apache.spark.sql.AnalysisException: Attribute name "apple- 
> mail_duration" contains invalid character(s) among " ,;{}()\n\t=". Please use 
> alias to rename it.
>
>
> If not how can I simply write the data frame as a csv file?
>
> Thanks,
> Tobi
>
>
>


Run Stored Procedures from Spark SqlContext

2016-07-11 Thread zachkirsch
Hi,

I have a SQL Server set up, and I also have a Spark cluster up and running
that is executing Scala programs. I can connect to the SQL Server and query
for data successfully. However, I need to call stored procedures from the
Scala/Spark code (stored procedures that exist in the database) and I can't
figure it out.

Can anyone help me out, or direct me to a forum that might be better suited
for this question? I can provide any more information that is helpful.

Thanks!



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Run-Stored-Procedures-from-Spark-SqlContext-tp27322.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Saving Table with Special Characters in Columns

2016-07-11 Thread Tobi Bosede
Hi everyone,

I am trying to save a data frame with special characters in the column
names as a table in hive. However I am getting the following error. Is the
only solution to rename all the columns? Or is there some argument that can
be passed to into the saveAsTable() or write.parquet() functions to ignore
special characters?

Py4JJavaError: An error occurred while calling o2956.saveAsTable.
: org.apache.spark.sql.AnalysisException: Attribute name "apple-
mail_duration" contains invalid character(s) among " ,;{}()\n\t=".
Please use alias to rename it.


If not how can I simply write the data frame as a csv file?

Thanks,
Tobi


Re: Question on Spark shell

2016-07-11 Thread Sivakumaran S
That was my bad with the title. 

I am getting that output when I run my application, both from the IDE as well 
as in the console. 

I want the server logs itself displayed in the terminal from where I start the 
server. Right now, running the command ‘start-master.sh’ returns the prompt. I 
want the Spark logs as events occur (INFO, WARN, ERROR); like enabling debug 
mode wherein server output is printed to screen. 

I have to edit the log4j properties file, that much I have learnt so far. 
Should be able to hack it now. Thanks for the help. Guess just helping to frame 
the question was enough to find the answer :)




> On 11-Jul-2016, at 6:57 PM, Anthony May  wrote:
> 
> I see. The title of your original email was "Spark Shell" which is a Spark 
> REPL environment based on the Scala Shell, hence why I misunderstood you.
> 
> You should have the same output starting the application on the console. You 
> are not seeing any output?
> 
> On Mon, 11 Jul 2016 at 11:55 Sivakumaran S  > wrote:
> I am running a spark streaming application using Scala in the IntelliJ IDE. I 
> can see the Spark output in the IDE itself (aggregation and stuff). I want 
> the spark server logging (INFO, WARN, etc) to be displayed in screen when I 
> start the master in the console. For example, when I start a kafka cluster, 
> the prompt is not returned and the debug log is printed to the terminal. I 
> want that set up with my spark server. 
> 
> I hope that explains my retrograde requirement :)
> 
> 
> 
>> On 11-Jul-2016, at 6:49 PM, Anthony May > > wrote:
>> 
>> Starting the Spark Shell gives you a Spark Context to play with straight 
>> away. The output is printed to the console.
>> 
>> On Mon, 11 Jul 2016 at 11:47 Sivakumaran S > > wrote:
>> Hello,
>> 
>> Is there a way to start the spark server with the log output piped to 
>> screen? I am currently running spark in the standalone mode on a single 
>> machine.
>> 
>> Regards,
>> 
>> Sivakumaran
>> 
>> 
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org 
>> 
>> 
> 



Re: Question on Spark shell

2016-07-11 Thread Anthony May
I see. The title of your original email was "Spark Shell" which is a Spark
REPL environment based on the Scala Shell, hence why I misunderstood you.

You should have the same output starting the application on the console.
You are not seeing any output?

On Mon, 11 Jul 2016 at 11:55 Sivakumaran S  wrote:

> I am running a spark streaming application using Scala in the IntelliJ
> IDE. I can see the Spark output in the IDE itself (aggregation and stuff).
> I want the spark server logging (INFO, WARN, etc) to be displayed in screen
> when I start the master in the console. For example, when I start a kafka
> cluster, the prompt is not returned and the debug log is printed to the
> terminal. I want that set up with my spark server.
>
> I hope that explains my retrograde requirement :)
>
>
>
> On 11-Jul-2016, at 6:49 PM, Anthony May  wrote:
>
> Starting the Spark Shell gives you a Spark Context to play with straight
> away. The output is printed to the console.
>
> On Mon, 11 Jul 2016 at 11:47 Sivakumaran S  wrote:
>
>> Hello,
>>
>> Is there a way to start the spark server with the log output piped to
>> screen? I am currently running spark in the standalone mode on a single
>> machine.
>>
>> Regards,
>>
>> Sivakumaran
>>
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>
>


Re: Question on Spark shell

2016-07-11 Thread Sivakumaran S
I am running a spark streaming application using Scala in the IntelliJ IDE. I 
can see the Spark output in the IDE itself (aggregation and stuff). I want the 
spark server logging (INFO, WARN, etc) to be displayed in screen when I start 
the master in the console. For example, when I start a kafka cluster, the 
prompt is not returned and the debug log is printed to the terminal. I want 
that set up with my spark server. 

I hope that explains my retrograde requirement :)



> On 11-Jul-2016, at 6:49 PM, Anthony May  wrote:
> 
> Starting the Spark Shell gives you a Spark Context to play with straight 
> away. The output is printed to the console.
> 
> On Mon, 11 Jul 2016 at 11:47 Sivakumaran S  > wrote:
> Hello,
> 
> Is there a way to start the spark server with the log output piped to screen? 
> I am currently running spark in the standalone mode on a single machine.
> 
> Regards,
> 
> Sivakumaran
> 
> 
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org 
> 
> 



Re: Question on Spark shell

2016-07-11 Thread Anthony May
Starting the Spark Shell gives you a Spark Context to play with straight
away. The output is printed to the console.

On Mon, 11 Jul 2016 at 11:47 Sivakumaran S  wrote:

> Hello,
>
> Is there a way to start the spark server with the log output piped to
> screen? I am currently running spark in the standalone mode on a single
> machine.
>
> Regards,
>
> Sivakumaran
>
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Question on Spark shell

2016-07-11 Thread Sivakumaran S
Hello,

Is there a way to start the spark server with the log output piped to screen? I 
am currently running spark in the standalone mode on a single machine. 

Regards,

Sivakumaran


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-07-11 Thread Mich Talebzadeh
Appreciate all the comments.

Hive on Spark. Spark runs as an execution engine and is only used when you
query Hive. Otherwise it is not running. I run it in Yarn client mode. let
me show you an example

In hive-site xml set the execution engine to be spark to spark. It requires
some configuration but it does work :)

Alternatively log in to hive and do the setting there


set hive.execution.engine=spark;
set spark.home=/usr/lib/spark-1.3.1-bin-hadoop2.6;
set spark.master=yarn-client;
set spark.executor.memory=3g;
set spark.driver.memory=3g;
set spark.executor.cores=8;
set spark.ui.port=;

Small test ride

First using Hive 2 on Spark 1.3.1 to find max(id) for a 100million rows
parquet table

hive> select max(id) from oraclehadoop.dummy_parquet;

Starting Spark Job = a7752b2b-d73a-45de-aced-ddf02810938d
Query Hive on Spark job[1] stages:
2
3
Status: Running (Hive on Spark job[1])
Job Progress Format
CurrentTime StageId_StageAttemptId:
SucceededTasksCount(+RunningTasksCount-FailedTasksCount)/TotalTasksCount
[StageCost]
2016-07-11 17:41:52,386 Stage-2_0: 0(+8)/24 Stage-3_0: 0/1
2016-07-11 17:41:55,409 Stage-2_0: 1(+8)/24 Stage-3_0: 0/1
2016-07-11 17:41:56,420 Stage-2_0: 8(+4)/24 Stage-3_0: 0/1
2016-07-11 17:41:58,434 Stage-2_0: 10(+2)/24Stage-3_0: 0/1
2016-07-11 17:41:59,440 Stage-2_0: 12(+8)/24Stage-3_0: 0/1
2016-07-11 17:42:01,455 Stage-2_0: 17(+7)/24Stage-3_0: 0/1
2016-07-11 17:42:02,462 Stage-2_0: 20(+4)/24Stage-3_0: 0/1
2016-07-11 17:42:04,476 Stage-2_0: 23(+1)/24Stage-3_0: 0/1
2016-07-11 17:42:05,483 Stage-2_0: 24/24 Finished   Stage-3_0: 1/1
Finished

*Status: Finished successfully in 14.12 seconds*OK
1
Time taken: 14.38 seconds, Fetched: 1 row(s)

--simply switch the engine in hive to MR

hive>
*set hive.execution.engine=mr;*Hive-on-MR is deprecated in Hive 2 and may
not be available in the future versions. Consider using a different
execution engine (i.e. spark, tez) or using Hive 1.X releases.

hive> select max(id) from oraclehadoop.dummy_parquet;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the
future versions. Consider using a different execution engine (i.e. spark,
tez) or using Hive 1.X releases.
Starting Job = job_1468226887011_0005, Tracking URL =
http://rhes564:8088/proxy/application_1468226887011_0005/
Kill Command = /home/hduser/hadoop-2.6.0/bin/hadoop job  -kill
job_1468226887011_0005
Hadoop job information for Stage-1: number of mappers: 24; number of
reducers: 1
2016-07-11 17:42:46,904 Stage-1 map = 0%,  reduce = 0%
2016-07-11 17:42:56,328 Stage-1 map = 4%,  reduce = 0%, Cumulative CPU
31.76 sec
2016-07-11 17:43:05,676 Stage-1 map = 8%,  reduce = 0%, Cumulative CPU
61.78 sec
2016-07-11 17:43:16,091 Stage-1 map = 13%,  reduce = 0%, Cumulative CPU
95.44 sec
2016-07-11 17:43:24,419 Stage-1 map = 17%,  reduce = 0%, Cumulative CPU
121.6 sec
2016-07-11 17:43:32,734 Stage-1 map = 21%,  reduce = 0%, Cumulative CPU
149.37 sec
2016-07-11 17:43:41,031 Stage-1 map = 25%,  reduce = 0%, Cumulative CPU
177.62 sec
2016-07-11 17:43:48,305 Stage-1 map = 29%,  reduce = 0%, Cumulative CPU
204.92 sec
2016-07-11 17:43:56,580 Stage-1 map = 33%,  reduce = 0%, Cumulative CPU
235.34 sec
2016-07-11 17:44:05,917 Stage-1 map = 38%,  reduce = 0%, Cumulative CPU
262.18 sec
2016-07-11 17:44:14,222 Stage-1 map = 42%,  reduce = 0%, Cumulative CPU
286.21 sec
2016-07-11 17:44:22,502 Stage-1 map = 46%,  reduce = 0%, Cumulative CPU
310.34 sec
2016-07-11 17:44:32,923 Stage-1 map = 50%,  reduce = 0%, Cumulative CPU
346.26 sec
2016-07-11 17:44:43,301 Stage-1 map = 54%,  reduce = 0%, Cumulative CPU
379.11 sec
2016-07-11 17:44:53,674 Stage-1 map = 58%,  reduce = 0%, Cumulative CPU
417.9 sec
2016-07-11 17:45:04,001 Stage-1 map = 63%,  reduce = 0%, Cumulative CPU
450.73 sec
2016-07-11 17:45:13,327 Stage-1 map = 67%,  reduce = 0%, Cumulative CPU
476.7 sec
2016-07-11 17:45:22,656 Stage-1 map = 71%,  reduce = 0%, Cumulative CPU
508.97 sec
2016-07-11 17:45:33,002 Stage-1 map = 75%,  reduce = 0%, Cumulative CPU
535.69 sec
2016-07-11 17:45:43,355 Stage-1 map = 79%,  reduce = 0%, Cumulative CPU
573.33 sec
2016-07-11 17:45:52,613 Stage-1 map = 83%,  reduce = 0%, Cumulative CPU
605.01 sec
2016-07-11 17:46:02,962 Stage-1 map = 88%,  reduce = 0%, Cumulative CPU
632.38 sec
2016-07-11 17:46:13,316 Stage-1 map = 92%,  reduce = 0%, Cumulative CPU
666.45 sec
2016-07-11 17:46:23,656 Stage-1 map = 96%,  reduce = 0%, Cumulative CPU
693.72 sec
2016-07-11 17:46:31,919 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU
714.71 sec
2016-07-11 17:46:36,060 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU
721.83 sec
MapReduce Total cumulative CPU time: 12 minutes 1 seconds 830 msec
Ended Job = job_1468226887011_0005
MapReduce Jobs Launched:
Stage-Stage-1: Map: 24  Reduce: 1   Cumulative CPU: 721.83 sec   HDFS Read:
400442823 HDFS Write: 10 SUCCESS
Total MapReduce CPU Time Spent: 12 minutes 1 seconds 830 msec
OK
1
*Time taken: 239.532 seconds, Fetched: 1 row(s)*


I leave it 

Using accumulators in Local mode for testing

2016-07-11 Thread harelglik
Hi,

I am writing an app in Spark ( 1.6.1 ) in which I am using an accumulator.
My accumulator is simply counting rows: acc += 1.
My test processes 4 files each with 4 rows however the value of the
accumulator in the end is not 16 and even worse is inconsistent between
runs.

Are accumulators not to be used in LocalMode?, Is it a known issue?

Many thanks,
Harel.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Using-accumulators-in-Local-mode-for-testing-tp27321.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: KEYS file?

2016-07-11 Thread Sean Owen
Yeah the canonical place for a project's KEYS file for ASF projects is

http://www.apache.org/dist/{project}/KEYS

and so you can indeed find this key among:

http://www.apache.org/dist/spark/KEYS

I'll put a link to this info on the downloads page because it is important info.

On Mon, Jul 11, 2016 at 4:48 AM, Shuai Lin  wrote:
>> at least links to the keys used to sign releases on the
>> download page
>
>
> +1 for that.
>
> On Mon, Jul 11, 2016 at 3:35 AM, Phil Steitz  wrote:
>>
>> On 7/10/16 10:57 AM, Shuai Lin wrote:
>> > Not sure where you see " 0x7C6C105FFC8ED089". I
>>
>> That's the key ID for the key below.
>> > think the release is signed with the
>> > key https://people.apache.org/keys/committer/pwendell.asc .
>>
>> Thanks!  That key matches.  The project should publish a KEYS file
>> [1] or at least links to the keys used to sign releases on the
>> download page.  Could be there is one somewhere and I just can't
>> find it.
>>
>> Phil
>>
>> [1] http://www.apache.org/dev/release-signing.html#keys-policy
>> >
>> > I think this tutorial can be
>> > helpful: http://www.apache.org/info/verification.html
>> >
>> > On Mon, Jul 11, 2016 at 12:57 AM, Phil Steitz
>> > > wrote:
>> >
>> > I can't seem to find a link the the Spark KEYS file.  I am
>> > trying to
>> > validate the sigs on the 1.6.2 release artifacts and I need to
>> > import 0x7C6C105FFC8ED089.  Is there a KEYS file available for
>> > download somewhere?  Apologies if I am just missing an obvious
>> > link.
>> >
>> > Phil
>> >
>> >
>> >
>> > -
>> > To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>> > 
>> >
>> >
>>
>>
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



What is the maximum number of column being supported by apache spark dataframe

2016-07-11 Thread Zijing Guo
Hi all, SPARK-Version: 1.5.2 with yarn 2.7.1.2.3.0.0-2557I'm running into a 
problem while I'm exploring the data through spark-shell that I'm trying to 
create a really fat dataframe that with 3000 columns. Code as below:val 
valueFunctionUDF = udf((valMap: Map[String, String], dataItemId: String) =>
  valMap.get(dataItemId) match {
  case Some(v) => v.toDouble
  case None => Double.NaN
})
s1 is being the main dataframe and the schema as below:|-- combKey: string 
(nullable = true)
|-- valMaps: map (nullable = true)
||-- key: string
||-- value: string (valueContainsNull = true)
after I run the code:dataItemIdVals.foreach{w =>
 s1 = s1.withColumn(w, valueFunctionUDF($"valMaps", $"combKey"))}
my terminal just stuck after the above column with the info being printed 
out:16/07/11 12:20:54 INFO BlockManagerInfo: Removed broadcast_3_piece0 on 
172.22.49.20:41494 in memory (size: 7.6 KB, free: 5.2 GB)
 16/07/11 12:20:54 INFO BlockManagerInfo: Removed broadcast_3_piece0 on 
x:43026 in memory (size: 7.6 KB, free: 5.1 GB)
 16/07/11 12:20:54 INFO BlockManagerInfo: Removed broadcast_3_piece0 on 
x:44890 in memory (size: 7.6 KB, free: 5.1 GB)
 16/07/11 12:20:54 INFO BlockManagerInfo: Removed broadcast_3_piece0 on 
x:52020 in memory (size: 7.6 KB, free: 5.1 GB)
 16/07/11 12:20:54 INFO BlockManagerInfo: Removed broadcast_3_piece0 on 
x:33272 in memory (size: 7.6 KB, free: 5.1 GB)
 16/07/11 12:20:54 INFO BlockManagerInfo: Removed broadcast_3_piece0 on 
x:48481 in memory (size: 7.6 KB, free: 5.1 GB)
 16/07/11 12:20:54 INFO BlockManagerInfo: Removed broadcast_3_piece0 on 
x:44026 in memory (size: 7.6 KB, free: 5.1 GB)
 16/07/11 12:20:54 INFO BlockManagerInfo: Removed broadcast_3_piece0 on 
x:34539 in memory (size: 7.6 KB, free: 5.0 GB)
 16/07/11 12:20:54 INFO BlockManagerInfo: Removed broadcast_3_piece0 on 
x:43734 in memory (size: 7.6 KB, free: 5.1 GB)
 16/07/11 12:20:54 INFO BlockManagerInfo: Removed broadcast_3_piece0 on 
x:42769 in memory (size: 7.6 KB, free: 5.1 GB)
 16/07/11 12:20:54 INFO BlockManagerInfo: Removed broadcast_3_piece0 on 
x:60603 in memory (size: 7.6 KB, free: 5.1 GB)
 16/07/11 12:20:54 INFO BlockManagerInfo: Removed broadcast_3_piece0 on 
x:59102 in memory (size: 7.6 KB, free: 5.1 GB)
 16/07/11 12:20:54 INFO BlockManagerInfo: Removed broadcast_3_piece0 on 
x:47578 in memory (size: 7.6 KB, free: 5.1 GB)
 16/07/11 12:20:54 INFO BlockManagerInfo: Removed broadcast_3_piece0 on 
x:43149 in memory (size: 7.6 KB, free: 5.1 GB)
 16/07/11 12:20:54 INFO BlockManagerInfo: Removed broadcast_3_piece0 on 
x:52488 in memory (size: 7.6 KB, free: 5.1 GB)
 16/07/11 12:20:54 INFO BlockManagerInfo: Removed broadcast_3_piece0 on 
x:52298 in memory (size: 7.6 KB, free: 5.1 GB)
 16/07/11 12:20:54 INFO ContextCleaner: Cleaned accumulator 9
 16/07/11 12:20:54 INFO BlockManagerInfo: Removed broadcast_2_piece0 on 
172.22.49.20:41494 in memory (size: 7.3 KB, free: 5.2 GB)
 16/07/11 12:20:54 INFO BlockManagerInfo: Removed broadcast_2_piece0 on 
x:33272 in memory (size: 7.3 KB, free: 5.1 GB)
 16/07/11 12:20:54 INFO BlockManagerInfo: Removed broadcast_2_piece0 on 
x:59102 in memory (size: 7.3 KB, free: 5.1 GB)
 16/07/11 12:20:54 INFO BlockManagerInfo: Removed broadcast_2_piece0 on 
x:44026 in memory (size: 7.3 KB, free: 5.1 GB)
 16/07/11 12:20:54 INFO BlockManagerInfo: Removed broadcast_2_piece0 on 
x:42769 in memory (size: 7.3 KB, free: 5.1 GB)
 16/07/11 12:20:54 INFO BlockManagerInfo: Removed broadcast_2_piece0 on 
x:43149 in memory (size: 7.3 KB, free: 5.1 GB)
 16/07/11 12:20:54 INFO BlockManagerInfo: Removed broadcast_2_piece0 on 
x:43026 in memory (size: 7.3 KB, free: 5.1 GB)
 16/07/11 12:20:54 INFO BlockManagerInfo: Removed broadcast_2_piece0 on 
x:52298 in memory (size: 7.3 KB, free: 5.1 GB)
 16/07/11 12:20:54 INFO BlockManagerInfo: Removed broadcast_2_piece0 on 
x:42890 in memory (size: 7.3 KB, free: 5.1 GB)
 16/07/11 12:20:54 INFO BlockManagerInfo: Removed broadcast_2_piece0 on 
x:47578 in memory (size: 7.3 KB, free: 5.1 GB)
 16/07/11 12:20:54 INFO BlockManagerInfo: Removed broadcast_2_piece0 on 
x:60603 in memory (size: 7.3 KB, free: 5.1 GB)
 16/07/11 12:20:54 INFO BlockManagerInfo: Removed broadcast_2_piece0 on 
x:43734 in memory (size: 7.3 KB, free: 5.1 GB)
 16/07/11 12:20:54 INFO BlockManagerInfo: Removed broadcast_2_piece0 on 
x:48481 in memory (size: 7.3 KB, free: 5.1 GB)
 16/07/11 12:20:54 INFO BlockManagerInfo: Removed broadcast_2_piece0 on 
x:52020 in memory (size: 7.3 KB, free: 5.1 GB)
 16/07/11 12:20:54 INFO BlockManagerInfo: Removed broadcast_2_piece0 on 
x:52488 in memory (size: 7.3 KB, free: 5.1 GB)
 16/07/11 12:20:54 INFO BlockManagerInfo: Removed broadcast_2_piece0 on 
x:34539 in memory (size: 7.3 KB, free: 5.0 GB)
 16/07/11 12:20:54 INFO ContextCleaner: Cleaned accumulator 8
 16/07/11 12:20:54 INFO ContextCleaner: Cleaned shuffle 0
 16/07/11 12:20:54 INFO ContextCleaner: Cleaned 

Marking files as read in Spark Streaming

2016-07-11 Thread soumick dasgupta
Hi,

I am looking for a solution in Spark Streaming where I can mark the files
that I have already read in HDFS. This is to make sure that I am not
reading the same file by mistake and also to ensure that I have read all
the records in a given file.

Thank You,

Soumick


Re: Cluster mode deployment from jar in S3

2016-07-11 Thread Steve Loughran
the fact you are using s3:// URLs means that you are using EMR and it's S3 
binding lib. Which means you are probably going to have to talk to the AWS team 
there. Though I'm surprised to see a jets3t stack trace there, as the AWS s3: 
client uses the amazon SDKs.

S3n and s3a don't currently support IAM Auth, which is what's generating the 
warning. The code in question is actually hadoop-aws.JAR, not the spark team's 
direct code, and is fixed in Hadoop 2.8 ( see: 
HADOOP-12723)


On 4 Jul 2016, at 11:30, Ashic Mahtab > 
wrote:

Hi Lohith,
Thanks for the response.

The S3 bucket does have access restrictions, but the instances in which the 
Spark master and workers run have an IAM role policy that allows them access to 
it. As such, we don't really configure the cli with credentials...the IAM roles 
take care of that. Is there a way to make Spark work the same way? Or should I 
get temporary credentials somehow (like 
http://docs.aws.amazon.com/IAM/latest/UserGuide/id_credentials_temp_use-resources.html
 ), and use them to somehow submit the job? I guess I'll have to set it via 
environment variables; I can't put it in application code, as the issue is in 
downloading the jar from S3.

-Ashic.


From: lohith.sam...@mphasis.com
To: as...@live.com; 
user@spark.apache.org
Subject: RE: Cluster mode deployment from jar in S3
Date: Mon, 4 Jul 2016 09:50:50 +

Hi,
The aws CLI already has your access key aid and secret access 
key when you initially configured it.
Is your s3 bucket without any access restrictions?





Best regards / Mit freundlichen Grüßen / Sincères salutations
M. Lohith Samaga





From: Ashic Mahtab [mailto:as...@live.com]
Sent: Monday, July 04, 2016 15.06
To: Apache Spark
Subject: RE: Cluster mode deployment from jar in S3



Sorry to do this...but... *bump*




From: as...@live.com
To: user@spark.apache.org
Subject: Cluster mode deployment from jar in S3
Date: Fri, 1 Jul 2016 17:45:12 +0100
Hello,
I've got a Spark stand-alone cluster using EC2 instances. I can submit jobs 
using "--deploy-mode client", however using "--deploy-mode cluster" is proving 
to be a challenge. I've tries this:



spark-submit --class foo --master spark:://master-ip:7077 --deploy-mode cluster 
s3://bucket/dir/foo.jar



When I do this, I get:
16/07/01 16:23:16 ERROR ClientEndpoint: Exception from cluster was: 
java.lang.IllegalArgumentException: AWS Access Key ID and Secret Access Key 
must be specified as the username or password (respectively) of a s3 URL, or by 
setting the fs.s3.awsAccessKeyId or fs.s3.awsSecretAccessKey properties 
(respectively).
java.lang.IllegalArgumentException: AWS Access Key ID and Secret Access Key 
must be specified as the username or password (respectively) of a s3 URL, or by 
setting the fs.s3.awsAccessKeyId or fs.s3.awsSecretAccessKey properties 
(respectively).
at 
org.apache.hadoop.fs.s3.S3Credentials.initialize(S3Credentials.java:66)
at 
org.apache.hadoop.fs.s3.Jets3tFileSystemStore.initialize(Jets3tFileSystemStore.java:82)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:85)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:62)





Now I'm not using any S3 or hadoop stuff within my code (it's just an 
sc.parallelize(1 to 100)). So, I imagine it's the driver trying to fetch the 
jar. I haven't set the AWS Access Key Id and Secret as mentioned, but the role 
the machine's are in allow them to copy the jar. In other words, this works:



aws s3 cp s3://bucket/dir/foo.jar /tmp/foo.jar



I'm using Spark 1.6.2, and can't really think of what I can do so that I can 
submit the jar from s3 using cluster deploy mode. I've also tried simply 
downloading the jar onto a node, and spark-submitting that... that works in 
client mode, but I get a not found error when using cluster mode.



Any help will be appreciated.



Thanks,
Ashic.

Information transmitted by this e-mail is proprietary to Mphasis, its 
associated companies and/ or its customers and is intended
for use only by the individual or entity to which it is addressed, and may 
contain information that is privileged, confidential or
exempt from disclosure under applicable law. If you are not the intended 
recipient or it appears that this mail has been forwarded
to you without 

Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-07-11 Thread Michael Segel
Just a clarification. 

Tez is ‘vendor’ independent.  ;-) 

Yeah… I know…  Anyone can support it.  Only Hortonworks has stacked the deck in 
their favor. 

Drill could be in the same boat, although there now more committers who are not 
working for MapR. I’m not sure who outside of HW is supporting Tez. 

But I digress. 

Here in the Spark user list, I have to ask how do you run hive on spark? Is the 
execution engine … the spark context always running? (Client mode I assume) 
Are the executors always running?   Can you run multiple queries from multiple 
users in parallel? 

These are some of the questions that should be asked and answered when 
considering how viable spark is going to be as the engine under Hive… 

Thx

-Mike

> On May 29, 2016, at 3:35 PM, Mich Talebzadeh  
> wrote:
> 
> thanks I think the problem is that the TEZ user group is exceptionally quiet. 
> Just sent an email to Hive user group to see anyone has managed to built a 
> vendor independent version.
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> 
>  
> http://talebzadehmich.wordpress.com 
>  
> 
> On 29 May 2016 at 21:23, Jörn Franke  > wrote:
> Well I think it is different from MR. It has some optimizations which you do 
> not find in MR. Especially the LLAP option in Hive2 makes it interesting. 
> 
> I think hive 1.2 works with 0.7 and 2.0 with 0.8 . At least for 1.2 it is 
> integrated in the Hortonworks distribution. 
> 
> 
> On 29 May 2016, at 21:43, Mich Talebzadeh  > wrote:
> 
>> Hi Jorn,
>> 
>> I started building apache-tez-0.8.2 but got few errors. Couple of guys from 
>> TEZ user group kindly gave a hand but I could not go very far (or may be I 
>> did not make enough efforts) making it work.
>> 
>> That TEZ user group is very quiet as well.
>> 
>> My understanding is TEZ is MR with DAG but of course Spark has both plus 
>> in-memory capability.
>> 
>> It would be interesting to see what version of TEZ works as execution engine 
>> with Hive.
>> 
>> Vendors are divided on this (use Hive with TEZ) or use Impala instead of 
>> Hive etc as I am sure you already know.
>> 
>> Cheers,
>> 
>> 
>> 
>> 
>> Dr Mich Talebzadeh
>>  
>> LinkedIn  
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>  
>> 
>>  
>> http://talebzadehmich.wordpress.com 
>>  
>> 
>> On 29 May 2016 at 20:19, Jörn Franke > > wrote:
>> Very interesting do you plan also a test with TEZ?
>> 
>> On 29 May 2016, at 13:40, Mich Talebzadeh > > wrote:
>> 
>>> Hi,
>>> 
>>> I did another study of Hive using Spark engine compared to Hive with MR.
>>> 
>>> Basically took the original table imported using Sqoop and created and 
>>> populated a new ORC table partitioned by year and month into 48 partitions 
>>> as follows:
>>> 
>>> 
>>> ​ 
>>> Connections use JDBC via beeline. Now for each partition using MR it takes 
>>> an average of 17 minutes as seen below for each PARTITION..  Now that is 
>>> just an individual partition and there are 48 partitions.
>>> 
>>> In contrast doing the same operation with Spark engine took 10 minutes all 
>>> inclusive. I just gave up on MR. You can see the StartTime and FinishTime 
>>> from below
>>> 
>>> 
>>> 
>>> This is by no means indicate that Spark is much better than MR but shows 
>>> that some very good results can ve achieved using Spark engine.
>>> 
>>> 
>>> Dr Mich Talebzadeh
>>>  
>>> LinkedIn  
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>  
>>> 
>>>  
>>> http://talebzadehmich.wordpress.com 
>>>  
>>> 
>>> On 24 May 2016 at 08:03, Mich Talebzadeh >> > wrote:
>>> Hi,
>>> 
>>> We use Hive as the database and use Spark as an all purpose query tool.
>>> 
>>> Whether Hive is the write database for purpose or one is better off with 
>>> something like Phoenix on Hbase, well the answer is it depends and your 
>>> mileage varies. 
>>> 
>>> So fit for purpose.
>>> 
>>> Ideally what wants is to use the fastest  method to get the results. How 
>>> fast we confine it to our SLA agreements in production and that helps us 
>>> from unnecessary further work as we technologists like to play around.
>>> 
>>> So in short, we use Spark most of the time and use Hive as the backend 
>>> engine for data storage, 

WARN FileOutputCommitter: Failed to delete the temporary output directory of task: attempt_201607111453_128606_m_000000_0 - s3n://

2016-07-11 Thread Andy Davidson
I am running into serious performance problems with my spark 1.6 streaming
app. As it runs it gets slower and slower.

My app is simple. 

* It receives fairly large and complex JSON files. (twitter data)
* Converts the RDD to DataFrame
* Splits the data frame in to maybe 20 different data sets
* Writes each data set as JSON to s3
* Writing to S3 is really slow. I use an executorService to get the writes
to run in parallel

I found a lot of error log messages like the following error in my spark
streaming executor log files

Any suggestions?

Thanks

Andy

16/07/11 14:53:49 WARN FileOutputCommitter: Failed to delete the temporary
output directory of task: attempt_201607111453_128606_m_00_0 -
s3n://com.xxx/json/yyy/2016-07-11/146824482/_temporary/_attempt_20160711
1453_128606_m_00_0




Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-07-11 Thread Jörn Franke
I think llap should be in the future a general component so llap + spark can 
make sense. I see tez and spark not as competitors but they have different 
purposes. Hive+Tez+llap is not the same as hive+spark. I think it goes beyond 
that for interactive queries .
Tez - you should use a distribution (eg Hortonworks) - generally I would use a 
distribution for anything related to performance , testing etc. because doing 
an own installation is more complex and more difficult to maintain. Performance 
and also features will be less good if you do not use a distribution. Which one 
is up to your choice.

> On 11 Jul 2016, at 17:09, Mich Talebzadeh  wrote:
> 
> The presentation will go deeper into the topic. Otherwise some thoughts  of 
> mine. Fell free to comment. criticise :) 
> 
> I am a member of Spark Hive and Tez user groups plus one or two others
> Spark is by far the biggest in terms of community interaction
> Tez, typically one thread in a month
> Personally started building Tez for Hive from Tez source and gave up as it 
> was not working. This was my own build as opposed to a distro
> if Hive says you should use Spark or Tez then using Spark is a perfectly 
> valid choice
> If Tez & LLAP offers you a Spark (DAG + in-memory caching) under the bonnet 
> why bother.
> Yes I have seen some test results (Hive on Spark vs Hive on Tez) etc. but 
> they are a bit dated (not being unkind) and cannot be taken as is today. One 
> their concern if I recall was excessive CPU and memory usage of Spark but 
> then with the same token LLAP will add additional need for resources
> Essentially I am more comfortable to use less of technology stack than more.  
> With Hive and Spark (in this context) we have two. With Hive, Tez and LLAP, 
> we have three stacks to look after that add to skill cost as well.
> Yep. It is still good to keep it simple
> 
> My thoughts on this are that if you have a viable open source product like 
> Spark which is becoming a sort of Vogue in Big Data space and moving very 
> fast, why look for another one. Hive does what it says on the Tin and good 
> reliable Data Warehouse.
> 
> HTH
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> http://talebzadehmich.wordpress.com
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
> damage or destruction of data or any other property which may arise from 
> relying on this email's technical content is explicitly disclaimed. The 
> author will in no case be liable for any monetary damages arising from such 
> loss, damage or destruction.
>  
> 
>> On 11 July 2016 at 15:22, Ashok Kumar  wrote:
>> Hi Mich,
>> 
>> Your recent presentation in London on this topic "Running Spark on Hive or 
>> Hive on Spark"
>> 
>> Have you made any more interesting findings that you like to bring up?
>> 
>> If Hive is offering both Spark and Tez in addition to MR, what stopping one 
>> not to use Spark? I still don't get why TEZ + LLAP is going to be a better 
>> choice from what you mentioned?
>> 
>> thanking you 
>> 
>> 
>> 
>> On Tuesday, 31 May 2016, 20:22, Mich Talebzadeh  
>> wrote:
>> 
>> 
>> Couple of points if I may and kindly bear with my remarks.
>> 
>> Whilst it will be very interesting to try TEZ with LLAP. As I read from LLAP
>> 
>> "Sub-second queries require fast query execution and low setup cost. The 
>> challenge for Hive is to achieve this without giving up on the scale and 
>> flexibility that users depend on. This requires a new approach using a 
>> hybrid engine that leverages Tez and something new called  LLAP (Live Long 
>> and Process, #llap online).
>> 
>> LLAP is an optional daemon process running on multiple nodes, that provides 
>> the following:
>> Caching and data reuse across queries with compressed columnar data 
>> in-memory (off-heap)
>> Multi-threaded execution including reads with predicate pushdown and hash 
>> joins
>> High throughput IO using Async IO Elevator with dedicated thread and core 
>> per disk
>> Granular column level security across applications
>> "
>> OK so we have added an in-memory capability to TEZ by way of LLAP, In other 
>> words what Spark does already and BTW it does not require a daemon running 
>> on any host. Don't take me wrong. It is interesting but this sounds to me 
>> (without testing myself) adding caching capability to TEZ to bring it on par 
>> with SPARK.
>> 
>> Remember:
>> 
>> Spark -> DAG + in-memory caching
>> TEZ = MR on DAG
>> TEZ + LLAP => DAG + in-memory caching
>> 
>> OK it is another way getting the same result. However, my concerns:
>> 
>> Spark has a wide user base. I judge this from Spark user group traffic
>> TEZ user group has no traffic I am afraid
>> LLAP I don't know
>> Sounds like Hortonworks promote TEZ and Cloudera does not want to know 
>> anything about Hive. and they promote Impala but that 

Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-07-11 Thread Michael Segel
I don’t think that it would be a good comparison. 

If memory serves, Tez w LLAP is going to be running a separate engine that is 
constantly running, no? 

Spark?  That runs under hive… 

Unless you’re suggesting that the spark context is constantly running as part 
of the hiveserver2? 

> On May 23, 2016, at 6:51 PM, Jörn Franke  wrote:
> 
> 
> Hi Mich,
> 
> I think these comparisons are useful. One interesting aspect could be 
> hardware scalability in this context. Additionally different type of 
> computations. Furthermore, one could compare Spark and Tez+llap as execution 
> engines. I have the gut feeling that  each one can be justified by different 
> use cases.
> Nevertheless, there should be always a disclaimer for such comparisons, 
> because Spark and Hive are not good for a lot of concurrent lookups of single 
> rows. They are not good for frequently write small amounts of data (eg sensor 
> data). Here hbase could be more interesting. Other use cases can justify 
> graph databases, such as Titan, or text analytics/ data matching using Solr 
> on Hadoop.
> Finally, even if you have a lot of data you need to think if you always have 
> to process everything. For instance, I have found valid use cases in practice 
> where we decided to evaluate 10 machine learning models in parallel on only a 
> sample of data and only evaluate the "winning" model of the total of data.
> 
> As always it depends :) 
> 
> Best regards
> 
> P.s.: at least Hortonworks has in their distribution spark 1.5 with hive 1.2 
> and spark 1.6 with hive 1.2. Maybe they have somewhere described how to 
> manage bringing both together. You may check also Apache Bigtop (vendor 
> neutral distribution) on how they managed to bring both together.
> 
> On 23 May 2016, at 01:42, Mich Talebzadeh  > wrote:
> 
>> Hi,
>>  
>> I have done a number of extensive tests using Spark-shell with Hive DB and 
>> ORC tables.
>>  
>> Now one issue that we typically face is and I quote:
>>  
>> Spark is fast as it uses Memory and DAG. Great but when we save data it is 
>> not fast enough
>> 
>> OK but there is a solution now. If you use Spark with Hive and you are on a 
>> descent version of Hive >= 0.14, then you can also deploy Spark as execution 
>> engine for Hive. That will make your application run pretty fast as you no 
>> longer rely on the old Map-Reduce for Hive engine. In a nutshell what you 
>> are gaining speed in both querying and storage.
>>  
>> I have made some comparisons on this set-up and I am sure some of you will 
>> find it useful.
>>  
>> The version of Spark I use for Spark queries (Spark as query tool) is 1.6.
>> The version of Hive I use in Hive 2
>> The version of Spark I use as Hive execution engine is 1.3.1 It works and 
>> frankly Spark 1.3.1 as an execution engine is adequate (until we sort out 
>> the Hadoop libraries mismatch).
>>  
>> An example I am using Hive on Spark engine to find the min and max of IDs 
>> for a table with 1 billion rows:
>>  
>> 0: jdbc:hive2://rhes564:10010/default>  select min(id), max(id),avg(id), 
>> stddev(id) from oraclehadoop.dummy;
>> Query ID = hduser_20160523002031_3e22e26e-4293-4e90-ae8b-72fe9683c006
>>  
>>  
>> Starting Spark Job = 5e092ef9-d798-4952-b156-74df49da9151
>>  
>> INFO  : Completed compiling 
>> command(queryId=hduser_20160523002031_3e22e26e-4293-4e90-ae8b-72fe9683c006); 
>> Time taken: 1.911 seconds
>> INFO  : Executing 
>> command(queryId=hduser_20160523002031_3e22e26e-4293-4e90-ae8b-72fe9683c006): 
>> select min(id), max(id),avg(id), stddev(id) from oraclehadoop.dummy
>> INFO  : Query ID = hduser_20160523002031_3e22e26e-4293-4e90-ae8b-72fe9683c006
>> INFO  : Total jobs = 1
>> INFO  : Launching Job 1 out of 1
>> INFO  : Starting task [Stage-1:MAPRED] in serial mode
>>  
>> Query Hive on Spark job[0] stages:
>> 0
>> 1
>> Status: Running (Hive on Spark job[0])
>> Job Progress Format
>> CurrentTime StageId_StageAttemptId: 
>> SucceededTasksCount(+RunningTasksCount-FailedTasksCount)/TotalTasksCount 
>> [StageCost]
>> 2016-05-23 00:21:19,062 Stage-0_0: 0/22 Stage-1_0: 0/1
>> 2016-05-23 00:21:20,070 Stage-0_0: 0(+12)/22Stage-1_0: 0/1
>> 2016-05-23 00:21:23,119 Stage-0_0: 0(+12)/22Stage-1_0: 0/1
>> 2016-05-23 00:21:26,156 Stage-0_0: 13(+9)/22Stage-1_0: 0/1
>> INFO  :
>> Query Hive on Spark job[0] stages:
>> INFO  : 0
>> INFO  : 1
>> INFO  :
>> Status: Running (Hive on Spark job[0])
>> INFO  : Job Progress Format
>> CurrentTime StageId_StageAttemptId: 
>> SucceededTasksCount(+RunningTasksCount-FailedTasksCount)/TotalTasksCount 
>> [StageCost]
>> INFO  : 2016-05-23 00:21:19,062 Stage-0_0: 0/22 Stage-1_0: 0/1
>> INFO  : 2016-05-23 00:21:20,070 Stage-0_0: 0(+12)/22Stage-1_0: 0/1
>> INFO  : 2016-05-23 00:21:23,119 Stage-0_0: 0(+12)/22Stage-1_0: 0/1
>> INFO  : 2016-05-23 00:21:26,156 Stage-0_0: 13(+9)/22Stage-1_0: 0/1
>> 2016-05-23 00:21:29,181 Stage-0_0: 22/22 Finished 

spark UI what does storage memory x/y mean

2016-07-11 Thread Andy Davidson
My stream app is running into problems It seems to slow down over time. How
can I interpret the storage memory column. I wonder if I have a GC problem?
Any idea how I can get GC stats?

Thanks

Andy

Executors (3)
* Memory: 9.4 GB Used (1533.4 MB Total)
* Disk: 0.0 B Used
Executor IDAddressRDD BlocksStorage MemoryDisk UsedActive TasksFailed
TasksComplete TasksTotal TasksTask TimeInputShuffle ReadShuffle
WriteLogsThread Dump
0ip-172-31-23-202.us-west-1.compute.internal:5245628604.7 GB / 511.1 MB0.0
B04013495793499805.37 h72.9 GB84.0 B5.9 MBstdout

stderr 

Thread Dump 

1ip-172-31-23-200.us-west-1.compute.internal:5160928544.6 GB / 511.1 MB0.0
B04113493653497765.42 h72.6 GB142.0 B5.9 MBstdout

stderr 

Thread Dump 

driver172.31.23.203:4801800.0 B / 511.1 MB0.0 B0 ms0.0 B0.0 B0.0 BThread
Dump 





Re: Isotonic Regression, run method overloaded Error

2016-07-11 Thread Fridtjof Sander
Spark's implementation does perform PAVA on each partition only to then 
collect each result to the driver and to perform PAVA again on the 
collected results. The hope of that is, that enough data is pooled, so 
that the the last step does not exceed the drivers memory limits. This 
assumption does of course not generally hold. Just consider what 
happens, if the data is already correctly sorted. In that case nothing 
is pooled and model size roughly equals data size. Spark's IR model 
saves boundaries and predictions as double arrays instead, so the 
(unpooled) data has to fit into memory.


https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/regression/IsotonicRegression.scala

Am 11.07.2016 um 17:06 schrieb Yanbo Liang:
IsotonicRegression can handle feature column of vector type. It will 
extract the a certain index (controlled by param "featureIndex") of 
this feature vector and feed it into model training. It will perform 
Pool adjacent violators algorithms on each partition, so it's 
distributed and the data is not necessary to fit into memory of a 
single machine.

The following code snippets can work well on my machine:

|val labels = Seq(1, 2, 3, 1, 6, 17, 16, 17, 18) val dataset = 
spark.createDataFrame( labels.zipWithIndex.map { case (label, i) => 
(label, Vectors.dense(Array(i.toDouble, i.toDouble + 1.0)), 1.0) } 
).toDF("label", "features", "weight") val ir = new 
IsotonicRegression().setIsotonic(true) val model = ir.fit(dataset) val 
predictions = model .transform(dataset) .select("prediction").rdd.map 
{ case Row(pred) => pred }.collect() assert(predictions === Array(1, 
2, 2, 2, 6, 16.5, 16.5, 17, 18)) |


Thanks
Yanbo

2016-07-11 6:14 GMT-07:00 Fridtjof Sander 
>:


Hi Swaroop,

from my understanding, Isotonic Regression is currently limited to
data with 1 feature plus weight and label. Also the entire data is
required to fit into memory of a single machine.
I did some work on the latter issue but discontinued the project,
because I felt no one really needed it. I'd be happy to resume my
work on Spark's IR implementation, but I fear there won't be a
quick for your issue.

Fridtjof


Am 08.07.2016 um 22:38 schrieb dsp:

Hi I am trying to perform Isotonic Regression on a data set
with 9 features
and a label.
When I run the algorithm similar to the way mentioned on MLlib
page, I get
the error saying

/*error:* overloaded method value run with alternatives:
(input: org.apache.spark.api.java.JavaRDD[(java.lang.Double,
java.lang.Double,

java.lang.Double)])org.apache.spark.mllib.regression.IsotonicRegressionModel

   (input: org.apache.spark.rdd.RDD[(scala.Double, scala.Double,
scala.Double)])org.apache.spark.mllib.regression.IsotonicRegressionModel
  cannot be applied to
(org.apache.spark.rdd.RDD[(scala.Double, scala.Double,
scala.Double, scala.Double, scala.Double, scala.Double,
scala.Double,
scala.Double, scala.Double, scala.Double, scala.Double,
scala.Double,
scala.Double)])
  val model = new
IsotonicRegression().setIsotonic(true).run(training)/

For the may given in the sample code, it looks like it can be
done only for
dataset with a single feature because run() method can accept
only three
parameters leaving which already has a label and a default
value leaving
place for only one variable.
So, How can this be done for multiple variables ?

Regards,
Swaroop



--
View this message in context:

http://apache-spark-user-list.1001560.n3.nabble.com/Isotonic-Regression-run-method-overloaded-Error-tp27313.html
Sent from the Apache Spark User List mailing list archive at
Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org




-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org







Re: Isotonic Regression, run method overloaded Error

2016-07-11 Thread Yanbo Liang
IsotonicRegression can handle feature column of vector type. It will
extract the a certain index (controlled by param "featureIndex") of this
feature vector and feed it into model training. It will perform Pool
adjacent violators algorithms on each partition, so it's distributed and
the data is not necessary to fit into memory of a single machine.
The following code snippets can work well on my machine:

val labels = Seq(1, 2, 3, 1, 6, 17, 16, 17, 18)
val dataset = spark.createDataFrame(
  labels.zipWithIndex.map { case (label, i) =>
(label, Vectors.dense(Array(i.toDouble, i.toDouble + 1.0)), 1.0)
  }
).toDF("label", "features", "weight")

val ir = new IsotonicRegression().setIsotonic(true)

val model = ir.fit(dataset)

val predictions = model
  .transform(dataset)
  .select("prediction").rdd.map { case Row(pred) =>
  pred
}.collect()

assert(predictions === Array(1, 2, 2, 2, 6, 16.5, 16.5, 17, 18))


Thanks
Yanbo

2016-07-11 6:14 GMT-07:00 Fridtjof Sander :

> Hi Swaroop,
>
> from my understanding, Isotonic Regression is currently limited to data
> with 1 feature plus weight and label. Also the entire data is required to
> fit into memory of a single machine.
> I did some work on the latter issue but discontinued the project, because
> I felt no one really needed it. I'd be happy to resume my work on Spark's
> IR implementation, but I fear there won't be a quick for your issue.
>
> Fridtjof
>
>
> Am 08.07.2016 um 22:38 schrieb dsp:
>
>> Hi I am trying to perform Isotonic Regression on a data set with 9
>> features
>> and a label.
>> When I run the algorithm similar to the way mentioned on MLlib page, I get
>> the error saying
>>
>> /*error:* overloaded method value run with alternatives:
>> (input: org.apache.spark.api.java.JavaRDD[(java.lang.Double,
>> java.lang.Double,
>>
>> java.lang.Double)])org.apache.spark.mllib.regression.IsotonicRegressionModel
>> 
>>(input: org.apache.spark.rdd.RDD[(scala.Double, scala.Double,
>> scala.Double)])org.apache.spark.mllib.regression.IsotonicRegressionModel
>>   cannot be applied to (org.apache.spark.rdd.RDD[(scala.Double,
>> scala.Double,
>> scala.Double, scala.Double, scala.Double, scala.Double, scala.Double,
>> scala.Double, scala.Double, scala.Double, scala.Double, scala.Double,
>> scala.Double)])
>>   val model = new
>> IsotonicRegression().setIsotonic(true).run(training)/
>>
>> For the may given in the sample code, it looks like it can be done only
>> for
>> dataset with a single feature because run() method can accept only three
>> parameters leaving which already has a label and a default value leaving
>> place for only one variable.
>> So, How can this be done for multiple variables ?
>>
>> Regards,
>> Swaroop
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Isotonic-Regression-run-method-overloaded-Error-tp27313.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-07-11 Thread Mich Talebzadeh
The presentation will go deeper into the topic. Otherwise some thoughts  of
mine. Fell free to comment. criticise :)


   1. I am a member of Spark Hive and Tez user groups plus one or two others
   2. Spark is by far the biggest in terms of community interaction
   3. Tez, typically one thread in a month
   4. Personally started building Tez for Hive from Tez source and gave up
   as it was not working. This was my own build as opposed to a distro
   5. if Hive says you should use Spark or Tez then using Spark is a
   perfectly valid choice
   6. If Tez & LLAP offers you a Spark (DAG + in-memory caching) under the
   bonnet why bother.
   7. Yes I have seen some test results (Hive on Spark vs Hive on Tez) etc.
   but they are a bit dated (not being unkind) and cannot be taken as is
   today. One their concern if I recall was excessive CPU and memory usage of
   Spark but then with the same token LLAP will add additional need for
   resources
   8. Essentially I am more comfortable to use less of technology stack
   than more.  With Hive and Spark (in this context) we have two. With Hive,
   Tez and LLAP, we have three stacks to look after that add to skill cost as
   well.
   9. Yep. It is still good to keep it simple


My thoughts on this are that if you have a viable open source product like
Spark which is becoming a sort of Vogue in Big Data space and moving very
fast, why look for another one. Hive does what it says on the Tin and good
reliable Data Warehouse.

HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 11 July 2016 at 15:22, Ashok Kumar  wrote:

> Hi Mich,
>
> Your recent presentation in London on this topic "Running Spark on Hive or
> Hive on Spark"
>
> Have you made any more interesting findings that you like to bring up?
>
> If Hive is offering both Spark and Tez in addition to MR, what stopping
> one not to use Spark? I still don't get why TEZ + LLAP is going to be a
> better choice from what you mentioned?
>
> thanking you
>
>
>
> On Tuesday, 31 May 2016, 20:22, Mich Talebzadeh 
> wrote:
>
>
> Couple of points if I may and kindly bear with my remarks.
>
> Whilst it will be very interesting to try TEZ with LLAP. As I read from
> LLAP
>
> "Sub-second queries require fast query execution and low setup cost. The
> challenge for Hive is to achieve this without giving up on the scale and
> flexibility that users depend on. This requires a new approach using a
> hybrid engine that leverages Tez and something new called  LLAP (Live Long
> and Process, #llap online).
>
> LLAP is an optional daemon process running on multiple nodes, that
> provides the following:
>
>- Caching and data reuse across queries with compressed columnar data
>in-memory (off-heap)
>- Multi-threaded execution including reads with predicate pushdown and
>hash joins
>- High throughput IO using Async IO Elevator with dedicated thread and
>core per disk
>- Granular column level security across applications
>- "
>
> OK so we have added an in-memory capability to TEZ by way of LLAP, In
> other words what Spark does already and BTW it does not require a daemon
> running on any host. Don't take me wrong. It is interesting but this sounds
> to me (without testing myself) adding caching capability to TEZ to bring it
> on par with SPARK.
>
> Remember:
>
> Spark -> DAG + in-memory caching
> TEZ = MR on DAG
> TEZ + LLAP => DAG + in-memory caching
>
> OK it is another way getting the same result. However, my concerns:
>
>
>- Spark has a wide user base. I judge this from Spark user group
>traffic
>- TEZ user group has no traffic I am afraid
>- LLAP I don't know
>
> Sounds like Hortonworks promote TEZ and Cloudera does not want to know
> anything about Hive. and they promote Impala but that sounds like a sinking
> ship these days.
>
> Having said that I will try TEZ + LLAP :) No pun intended
>
> Regards
>
> Dr Mich Talebzadeh
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
> http://talebzadehmich.wordpress.com
>
>
> On 31 May 2016 at 08:19, Jörn Franke  wrote:
>
> Thanks very interesting explanation. Looking forward to test it.
>
> > On 31 May 2016, at 07:51, Gopal Vijayaraghavan 
> wrote:
> >
> >
> >> That being said all systems are evolving. 

Spark hangs at "Removed broadcast_*"

2016-07-11 Thread velvetbaldmime
Spark 2.0.0-preview

We've got an app that uses a fairly big broadcast variable. We run this on a
big EC2 instance, so deployment is in client-mode. Broadcasted variable is a
massive Map[String, Array[String]].

At the end of saveAsTextFile, the output in the folder seems to be complete
and correct (apart from .crc files still being there) BUT the spark-submit
process is stuck on, seemingly, removing the broadcast variable. The stuck
logs look like this: http://pastebin.com/wpTqvArY

My last run lasted for 12 hours after after doing saveAsTextFile - just
sitting there. I did a jstack on driver process, most threads are parked:
http://pastebin.com/E29JKVT7

Full store: We used this code with Spark 1.5.0 and it worked, but then the
data changed and something stopped fitting into Kryo's serialisation buffer.
Increasing it didn't help, so I had to disable the KryoSerialiser. Tested it
again - it hanged. Switched to 2.0.0-preview - seems like the same issue.

I'm not quite sure what's even going on given that there's almost no CPU
activity and no output in the logs, yet the output is not finalised like it
used to before.

Would appreciate any help, thanks



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-hangs-at-Removed-broadcast-tp27320.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-07-11 Thread Ashok Kumar
Hi Mich,
Your recent presentation in London on this topic "Running Spark on Hive or Hive 
on Spark"
Have you made any more interesting findings that you like to bring up?
If Hive is offering both Spark and Tez in addition to MR, what stopping one not 
to use Spark? I still don't get why TEZ + LLAP is going to be a better choice 
from what you mentioned?
thanking you 
 

On Tuesday, 31 May 2016, 20:22, Mich Talebzadeh  
wrote:
 

 Couple of points if I may and kindly bear with my remarks. 
Whilst it will be very interesting to try TEZ with LLAP. As I read from LLAP
"Sub-second queries require fast query execution and low setup cost. The 
challenge for Hive is to achieve this without giving up on the scale and 
flexibility that users depend on. This requires a new approach using a hybrid 
engine that leverages Tez and something new called  LLAP (Live Long and 
Process, #llap online).
LLAP is an optional daemon process running on multiple nodes, that provides the 
following:   
   - Caching and data reuse across queries with compressed columnar data 
in-memory (off-heap)
   - Multi-threaded execution including reads with predicate pushdown and hash 
joins
   - High throughput IO using Async IO Elevator with dedicated thread and core 
per disk
   - Granular column level security across applications
   - "
OK so we have added an in-memory capability to TEZ by way of LLAP, In other 
words what Spark does already and BTW it does not require a daemon running on 
any host. Don't take me wrong. It is interesting but this sounds to me (without 
testing myself) adding caching capability to TEZ to bring it on par with SPARK. 
Remember:
Spark -> DAG + in-memory cachingTEZ = MR on DAGTEZ + LLAP => DAG + in-memory 
caching
OK it is another way getting the same result. However, my concerns:
   
   - Spark has a wide user base. I judge this from Spark user group traffic
   - TEZ user group has no traffic I am afraid
   - LLAP I don't know
Sounds like Hortonworks promote TEZ and Cloudera does not want to know anything 
about Hive. and they promote Impala but that sounds like a sinking ship these 
days.
Having said that I will try TEZ + LLAP :) No pun intended
Regards
Dr Mich Talebzadeh LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
 http://talebzadehmich.wordpress.com 
On 31 May 2016 at 08:19, Jörn Franke  wrote:

Thanks very interesting explanation. Looking forward to test it.

> On 31 May 2016, at 07:51, Gopal Vijayaraghavan  wrote:
>
>
>> That being said all systems are evolving. Hive supports tez+llap which
>> is basically the in-memory support.
>
> There is a big difference between where LLAP & SparkSQL, which has to do
> with access pattern needs.
>
> The first one is related to the lifetime of the cache - the Spark RDD
> cache is per-user-session which allows for further operation in that
> session to be optimized.
>
> LLAP is designed to be hammered by multiple user sessions running
> different queries, designed to automate the cache eviction & selection
> process. There's no user visible explicit .cache() to remember - it's
> automatic and concurrent.
>
> My team works with both engines, trying to improve it for ORC, but the
> goals of both are different.
>
> I will probably have to write a proper academic paper & get it
> edited/reviewed instead of send my ramblings to the user lists like this.
> Still, this needs an example to talk about.
>
> To give a qualified example, let's leave the world of single use clusters
> and take the use-case detailed here
>
> http://hortonworks.com/blog/impala-vs-hive-performance-benchmark/
>
>
> There are two distinct problems there - one is that a single day sees upto
> 100k independent user sessions running queries and that most queries cover
> the last hour (& possibly join/compare against a similar hour aggregate
> from the past).
>
> The problem with having independent 100k user-sessions from different
> connections was that the SparkSQL layer drops the RDD lineage & cache
> whenever a user ends a session.
>
> The scale problem in general for Impala was that even though the data size
> was in multiple terabytes, the actual hot data was approx <20Gb, which
> resides on <10 machines with locality.
>
> The same problem applies when you apply RDD caching with something like
> un-replicated like Tachyon/Alluxio, since the same RDD will be exceeding
> popular that the machines which hold those blocks run extra hot.
>
> A cache model per-user session is entirely wasteful and a common cache +
> MPP model effectively overloads 2-3% of cluster, while leaving the other
> machines idle.
>
> LLAP was designed specifically to prevent that hotspotting, while
> maintaining the common cache model - within a few minutes after an hour
> ticks over, the whole cluster develops temporal popularity for the hot
> data and nearly every rack has at least one cached copy of the same data

Re: Isotonic Regression, run method overloaded Error

2016-07-11 Thread Fridtjof Sander

Hi Swaroop,

from my understanding, Isotonic Regression is currently limited to data 
with 1 feature plus weight and label. Also the entire data is required 
to fit into memory of a single machine.
I did some work on the latter issue but discontinued the project, 
because I felt no one really needed it. I'd be happy to resume my work 
on Spark's IR implementation, but I fear there won't be a quick for your 
issue.


Fridtjof

Am 08.07.2016 um 22:38 schrieb dsp:

Hi I am trying to perform Isotonic Regression on a data set with 9 features
and a label.
When I run the algorithm similar to the way mentioned on MLlib page, I get
the error saying

/*error:* overloaded method value run with alternatives:
(input: org.apache.spark.api.java.JavaRDD[(java.lang.Double,
java.lang.Double,
java.lang.Double)])org.apache.spark.mllib.regression.IsotonicRegressionModel

   (input: org.apache.spark.rdd.RDD[(scala.Double, scala.Double,
scala.Double)])org.apache.spark.mllib.regression.IsotonicRegressionModel
  cannot be applied to (org.apache.spark.rdd.RDD[(scala.Double, scala.Double,
scala.Double, scala.Double, scala.Double, scala.Double, scala.Double,
scala.Double, scala.Double, scala.Double, scala.Double, scala.Double,
scala.Double)])
  val model = new
IsotonicRegression().setIsotonic(true).run(training)/

For the may given in the sample code, it looks like it can be done only for
dataset with a single feature because run() method can accept only three
parameters leaving which already has a label and a default value leaving
place for only one variable.
So, How can this be done for multiple variables ?

Regards,
Swaroop



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Isotonic-Regression-run-method-overloaded-Error-tp27313.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org




-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Spark job state is EXITED but does not return

2016-07-11 Thread Balachandar R.A.
Hello,

I have one apache spark based simple use case that process two datasets.
Each dataset takes about 5-7 min to process. I am doing this processing
inside the sc.parallelize(datasets){ } block. While the first dataset is
processed successfully,  the processing of dataset is not started by spark.
The application state is RUNNING but in executor summary, I notice that the
state here is EXITED. Can someone tell me where things are going wrong?

Regards
Bala


Re: Connection via JDBC to Oracle hangs after count call

2016-07-11 Thread Chanh Le
Hi Mich,

If I have a stored procedure in Oracle write like this
SP get Info: 
PKG_ETL.GET_OBJECTS_INFO( 
p_LAST_UPDATED VARCHAR2, 
p_OBJECT_TYPE VARCHAR2, 
p_TABLE OUT SYS_REFCURSOR); 
How to call in Spark because the output is cursor p_TABLE OUT SYS_REFCURSOR.


Thanks.


> On Jul 11, 2016, at 4:18 PM, Mark Vervuurt  wrote:
> 
> Thanks Mich,
> 
> we have got it working using the example here under ;)
> 
> Mark
> 
>> On 11 Jul 2016, at 09:45, Mich Talebzadeh > > wrote:
>> 
>> Hi Mark,
>> 
>> Hm. It should work. This is Spark 1.6.1 on Oracle 12c
>>  
>>  
>> scala> val HiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
>> HiveContext: org.apache.spark.sql.hive.HiveContext = 
>> org.apache.spark.sql.hive.HiveContext@70f446c
>>  
>> scala> var _ORACLEserver : String = "jdbc:oracle:thin:@rhes564:1521:mydb12"
>> _ORACLEserver: String = jdbc:oracle:thin:@rhes564:1521:mydb12
>>  
>> scala> var _username : String = "sh"
>> _username: String = sh
>>  
>> scala> var _password : String = ""
>> _password: String = sh
>>  
>> scala> val c = HiveContext.load("jdbc",
>>  | Map("url" -> _ORACLEserver,
>>  | "dbtable" -> "(SELECT to_char(CHANNEL_ID) AS CHANNEL_ID, CHANNEL_DESC 
>> FROM sh.channels)",
>>  | "user" -> _username,
>>  | "password" -> _password))
>> warning: there were 1 deprecation warning(s); re-run with -deprecation for 
>> details
>> c: org.apache.spark.sql.DataFrame = [CHANNEL_ID: string, CHANNEL_DESC: 
>> string]
>>  
>> scala> c.registerTempTable("t_c")
>>  
>> scala> c.count
>> res2: Long = 5
>>  
>> scala> HiveContext.sql("select * from t_c").collect.foreach(println)
>> [3,Direct Sales]
>> [9,Tele Sales]
>> [5,Catalog]
>> [4,Internet]
>> [2,Partners]
>>  
>> HTH
>> 
>> 
>> Dr Mich Talebzadeh
>>  
>> LinkedIn  
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>  
>> 
>>  
>> http://talebzadehmich.wordpress.com 
>> 
>> Disclaimer: Use it at your own risk. Any and all responsibility for any 
>> loss, damage or destruction of data or any other property which may arise 
>> from relying on this email's technical content is explicitly disclaimed. The 
>> author will in no case be liable for any monetary damages arising from such 
>> loss, damage or destruction.
>>  
>> 
>> On 11 July 2016 at 08:25, Mark Vervuurt > > wrote:
>> Hi Mich,
>> 
>> sorry for bothering did you manage to solve your problem? We have a similar 
>> problem with Spark 1.5.2 using a JDBC connection with a DataFrame to an 
>> Oracle Database.
>> 
>> Thanks,
>> Mark
>> 
>>> On 12 Feb 2016, at 11:45, Mich Talebzadeh >> > wrote:
>>> 
>>> Hi,
>>>  
>>> I use the following to connect to Oracle DB from Spark shell 1.5.2
>>>  
>>> spark-shell --master spark://50.140.197.217:7077 <> --driver-class-path 
>>> /home/hduser/jars/ojdbc6.jar
>>>  
>>> in Scala I do
>>>  
>>> scala> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
>>> sqlContext: org.apache.spark.sql.SQLContext = 
>>> org.apache.spark.sql.SQLContext@f9d4387
>>>  
>>> scala> val channels = sqlContext.read.format("jdbc").options(
>>>  |  Map("url" -> "jdbc:oracle:thin:@rhes564:1521:mydb",
>>>  |  "dbtable" -> "(select * from sh.channels where channel_id = 
>>> 14)",
>>>  |  "user" -> "sh",
>>>  |   "password" -> "xxx")).load
>>> channels: org.apache.spark.sql.DataFrame = [CHANNEL_ID: decimal(0,-127), 
>>> CHANNEL_DESC: string, CHANNEL_CLASS: string, CHANNEL_CLASS_ID: 
>>> decimal(0,-127), CHANNEL_TOTAL: string, CHANNEL_TOTAL_ID: decimal(0,-127)]
>>>  
>>> scala> channels.count()
>>>  
>>> But the latter command keeps hanging?
>>>  
>>> Any ideas appreciated
>>>  
>>> Thanks,
>>>  
>>> Mich Talebzadeh
>>>  
>>> LinkedIn  
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>  
>>> 
>>>  
>>> http://talebzadehmich.wordpress.com 
>>>  
>>> NOTE: The information in this email is proprietary and confidential. This 
>>> message is for the designated recipient only, if you are not the intended 
>>> recipient, you should destroy it immediately. Any information in this 
>>> message shall not be understood as given or endorsed by Peridale Technology 
>>> Ltd, its subsidiaries or their employees, unless expressly so stated. It is 
>>> the responsibility of the recipient to ensure that this email is virus 
>>> free, therefore neither Peridale Technology Ltd, its subsidiaries nor their 
>>> employees accept any responsibility.
>> 
>> Met vriendelijke groet | Best regards,
>> 

Re: Connection via JDBC to Oracle hangs after count call

2016-07-11 Thread Mark Vervuurt
Thanks Mich,

we have got it working using the example here under ;)

Mark

> On 11 Jul 2016, at 09:45, Mich Talebzadeh  wrote:
> 
> Hi Mark,
> 
> Hm. It should work. This is Spark 1.6.1 on Oracle 12c
>  
>  
> scala> val HiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
> HiveContext: org.apache.spark.sql.hive.HiveContext = 
> org.apache.spark.sql.hive.HiveContext@70f446c
>  
> scala> var _ORACLEserver : String = "jdbc:oracle:thin:@rhes564:1521:mydb12"
> _ORACLEserver: String = jdbc:oracle:thin:@rhes564:1521:mydb12
>  
> scala> var _username : String = "sh"
> _username: String = sh
>  
> scala> var _password : String = ""
> _password: String = sh
>  
> scala> val c = HiveContext.load("jdbc",
>  | Map("url" -> _ORACLEserver,
>  | "dbtable" -> "(SELECT to_char(CHANNEL_ID) AS CHANNEL_ID, CHANNEL_DESC 
> FROM sh.channels)",
>  | "user" -> _username,
>  | "password" -> _password))
> warning: there were 1 deprecation warning(s); re-run with -deprecation for 
> details
> c: org.apache.spark.sql.DataFrame = [CHANNEL_ID: string, CHANNEL_DESC: string]
>  
> scala> c.registerTempTable("t_c")
>  
> scala> c.count
> res2: Long = 5
>  
> scala> HiveContext.sql("select * from t_c").collect.foreach(println)
> [3,Direct Sales]
> [9,Tele Sales]
> [5,Catalog]
> [4,Internet]
> [2,Partners]
>  
> HTH
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> 
>  
> http://talebzadehmich.wordpress.com 
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
> damage or destruction of data or any other property which may arise from 
> relying on this email's technical content is explicitly disclaimed. The 
> author will in no case be liable for any monetary damages arising from such 
> loss, damage or destruction.
>  
> 
> On 11 July 2016 at 08:25, Mark Vervuurt  > wrote:
> Hi Mich,
> 
> sorry for bothering did you manage to solve your problem? We have a similar 
> problem with Spark 1.5.2 using a JDBC connection with a DataFrame to an 
> Oracle Database.
> 
> Thanks,
> Mark
> 
>> On 12 Feb 2016, at 11:45, Mich Talebzadeh > > wrote:
>> 
>> Hi,
>>  
>> I use the following to connect to Oracle DB from Spark shell 1.5.2
>>  
>> spark-shell --master spark://50.140.197.217:7077 <> --driver-class-path 
>> /home/hduser/jars/ojdbc6.jar
>>  
>> in Scala I do
>>  
>> scala> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
>> sqlContext: org.apache.spark.sql.SQLContext = 
>> org.apache.spark.sql.SQLContext@f9d4387
>>  
>> scala> val channels = sqlContext.read.format("jdbc").options(
>>  |  Map("url" -> "jdbc:oracle:thin:@rhes564:1521:mydb",
>>  |  "dbtable" -> "(select * from sh.channels where channel_id = 14)",
>>  |  "user" -> "sh",
>>  |   "password" -> "xxx")).load
>> channels: org.apache.spark.sql.DataFrame = [CHANNEL_ID: decimal(0,-127), 
>> CHANNEL_DESC: string, CHANNEL_CLASS: string, CHANNEL_CLASS_ID: 
>> decimal(0,-127), CHANNEL_TOTAL: string, CHANNEL_TOTAL_ID: decimal(0,-127)]
>>  
>> scala> channels.count()
>>  
>> But the latter command keeps hanging?
>>  
>> Any ideas appreciated
>>  
>> Thanks,
>>  
>> Mich Talebzadeh
>>  
>> LinkedIn  
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>  
>> 
>>  
>> http://talebzadehmich.wordpress.com 
>>  
>> NOTE: The information in this email is proprietary and confidential. This 
>> message is for the designated recipient only, if you are not the intended 
>> recipient, you should destroy it immediately. Any information in this 
>> message shall not be understood as given or endorsed by Peridale Technology 
>> Ltd, its subsidiaries or their employees, unless expressly so stated. It is 
>> the responsibility of the recipient to ensure that this email is virus free, 
>> therefore neither Peridale Technology Ltd, its subsidiaries nor their 
>> employees accept any responsibility.
> 
> Met vriendelijke groet | Best regards,
> ___
> 
> Ir. Mark Vervuurt
> Senior Big Data Scientist | Insights & Data
> 
> Capgemini Nederland | Utrecht
> Tel.: +31 30 6890978  – Mob.: +31653670390 
> 
> www.capgemini.com 
> 
>  People matter, results count.
> __
> 
> 
> 
> 



question about UDAF

2016-07-11 Thread luohui20001
hello guys: I have a DF and a UDAF. this DF has 2 columns, lp_location_id , 
id, both are of Int type. I want to group by id and aggregate all value of id 
into 1 string. So I used a UDAF to do this transformation: multi Int values to 
1 String. However my UDAF returns empty values as the accessory attached. 
Here is my code for my main class:val hc = new 
org.apache.spark.sql.hive.HiveContext(sc)
val hiveTable = hc.sql("select lp_location_id,id from 
house_id_pv_location_top50")

val jsonArray = new JsonArray
val result = 
hiveTable.groupBy("lp_location_id").agg(jsonArray(col("id")).as("jsonArray")).collect.foreach(println)
-- Here is 
my code of my UDAF:
class JsonArray extends UserDefinedAggregateFunction {
  def inputSchema: org.apache.spark.sql.types.StructType =
StructType(StructField("id", IntegerType) :: Nil)

  def bufferSchema: StructType = StructType(
StructField("id", StringType) :: Nil)
  def dataType: DataType = StringType
  def deterministic: Boolean = true
  def initialize(buffer: MutableAggregationBuffer): Unit = {
buffer(0) = ""
  }
  def update(buffer: MutableAggregationBuffer, input: Row): Unit = {
buffer(0) = buffer.getAs[Int](0)
  }
  def merge(buffer1: MutableAggregationBuffer, buffer2: Row): Unit = {
val s1 = buffer1.getAs[Int](0).toString()
val s2 = buffer2.getAs[Int](0).toString()
buffer1(0) = s1.concat(s2)
  }
  def evaluate(buffer: Row): Any = {
buffer(0)
  }
}

I don't quit understand why I get empty result from my UDAF, I guess there may 
be 2 reason:1. error initialization with "" in code of define initialize 
method2. the buffer didn't write successfully.
can anyone share a idea about this. thank you.





 

ThanksBest regards!
San.Luo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Zeppelin Spark with Dynamic Allocation

2016-07-11 Thread Chanh Le
Hi Tamas,
I am using Spark 1.6.1.





> On Jul 11, 2016, at 3:24 PM, Tamas Szuromi  wrote:
> 
> Hello,
> 
> What spark version do you use? I have the same issue with Spark 1.6.1 and 
> there is a ticket somewhere.
> 
> cheers,
> 
> 
> 
> 
> Tamas Szuromi
> Data Analyst
> Skype: tromika
> E-mail: tamas.szur...@odigeo.com 
> 
> ODIGEO Hungary Kft.
> 1066 Budapest
> Weiner Leó u. 16.
> www.liligo.com  
> check out our newest video  
> 
> 
> 
> On 11 July 2016 at 10:09, Chanh Le  > wrote:
> Hi everybody,
> I am testing zeppelin with dynamic allocation but seem it’s not working.
> 
> 
> 
> 
> 
> 
> 
> Logs I received I saw that Spark Context was created successfully and task 
> was running but after that was terminated.
> Any ideas on that?
> Thanks.
> 
> 
> 
>  INFO [2016-07-11 15:03:40,096] ({Thread-0} 
> RemoteInterpreterServer.java[run]:81) - Starting remote interpreter server on 
> port 24994
>  INFO [2016-07-11 15:03:40,471] ({pool-1-thread-2} 
> RemoteInterpreterServer.java[createInterpreter]:169) - Instantiate 
> interpreter org.apache.zeppelin.spark.SparkInterpreter
>  INFO [2016-07-11 15:03:40,521] ({pool-1-thread-2} 
> RemoteInterpreterServer.java[createInterpreter]:169) - Instantiate 
> interpreter org.apache.zeppelin.spark.PySparkInterpreter
>  INFO [2016-07-11 15:03:40,526] ({pool-1-thread-2} 
> RemoteInterpreterServer.java[createInterpreter]:169) - Instantiate 
> interpreter org.apache.zeppelin.spark.SparkRInterpreter
>  INFO [2016-07-11 15:03:40,528] ({pool-1-thread-2} 
> RemoteInterpreterServer.java[createInterpreter]:169) - Instantiate 
> interpreter org.apache.zeppelin.spark.SparkSqlInterpreter
>  INFO [2016-07-11 15:03:40,531] ({pool-1-thread-2} 
> RemoteInterpreterServer.java[createInterpreter]:169) - Instantiate 
> interpreter org.apache.zeppelin.spark.DepInterpreter
>  INFO [2016-07-11 15:03:40,563] ({pool-2-thread-5} 
> SchedulerFactory.java[jobStarted]:131) - Job remoteInterpretJob_1468224220562 
> started by scheduler org.apache.zeppelin.spark.SparkInterpreter998491254
>  WARN [2016-07-11 15:03:41,559] ({pool-2-thread-5} 
> NativeCodeLoader.java[]:62) - Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
>  INFO [2016-07-11 15:03:41,703] ({pool-2-thread-5} Logging.scala[logInfo]:58) 
> - Changing view acls to: root
>  INFO [2016-07-11 15:03:41,704] ({pool-2-thread-5} Logging.scala[logInfo]:58) 
> - Changing modify acls to: root
>  INFO [2016-07-11 15:03:41,708] ({pool-2-thread-5} Logging.scala[logInfo]:58) 
> - SecurityManager: authentication disabled; ui acls disabled; users with view 
> permissions: Set(root); users with modify permissions: Set(root)
>  INFO [2016-07-11 15:03:41,977] ({pool-2-thread-5} Logging.scala[logInfo]:58) 
> - Starting HTTP Server
>  INFO [2016-07-11 15:03:42,029] ({pool-2-thread-5} Server.java[doStart]:272) 
> - jetty-8.y.z-SNAPSHOT
>  INFO [2016-07-11 15:03:42,047] ({pool-2-thread-5} 
> AbstractConnector.java[doStart]:338) - Started SocketConnector@0.0.0.0 
> :53313
>  INFO [2016-07-11 15:03:42,048] ({pool-2-thread-5} Logging.scala[logInfo]:58) 
> - Successfully started service 'HTTP class server' on port 53313.
>  INFO [2016-07-11 15:03:43,978] ({pool-2-thread-5} 
> SparkInterpreter.java[createSparkContext]:233) - -- Create new 
> SparkContext mesos://zk://master1:2181,master2:2181,master3:2181/mesos <> 
> ---
>  INFO [2016-07-11 15:03:44,003] ({pool-2-thread-5} Logging.scala[logInfo]:58) 
> - Running Spark version 1.6.1
>  INFO [2016-07-11 15:03:44,036] ({pool-2-thread-5} Logging.scala[logInfo]:58) 
> - Changing view acls to: root
>  INFO [2016-07-11 15:03:44,036] ({pool-2-thread-5} Logging.scala[logInfo]:58) 
> - Changing modify acls to: root
>  INFO [2016-07-11 15:03:44,037] ({pool-2-thread-5} Logging.scala[logInfo]:58) 
> - SecurityManager: authentication disabled; ui acls disabled; users with view 
> permissions: Set(root); users with modify permissions: Set(root)
>  INFO [2016-07-11 15:03:44,231] ({pool-2-thread-5} Logging.scala[logInfo]:58) 
> - Successfully started service 'sparkDriver' on port 33913.
>  INFO [2016-07-11 15:03:44,552] 
> ({sparkDriverActorSystem-akka.actor.default-dispatcher-4} 
> Slf4jLogger.scala[applyOrElse]:80) - Slf4jLogger started
>  INFO [2016-07-11 15:03:44,597] 
> ({sparkDriverActorSystem-akka.actor.default-dispatcher-4} 
> Slf4jLogger.scala[apply$mcV$sp]:74) - Starting remoting
>  INFO [2016-07-11 15:03:44,754] 
> ({sparkDriverActorSystem-akka.actor.default-dispatcher-4} 
> Slf4jLogger.scala[apply$mcV$sp]:74) - Remoting started; listening on 
> addresses :[akka.tcp://sparkDriverActorSystem@10.197.0.3:55213 <>]
>  INFO [2016-07-11 15:03:44,760] ({pool-2-thread-5} Logging.scala[logInfo]:58) 
> - Successfully started service 'sparkDriverActorSystem' on port 55213.
>  INFO 

Re: Zeppelin Spark with Dynamic Allocation

2016-07-11 Thread Tamas Szuromi
Hello,

What spark version do you use? I have the same issue with Spark 1.6.1 and
there is a ticket somewhere.

cheers,




Tamas Szuromi

Data Analyst

*Skype: *tromika
*E-mail: *tamas.szur...@odigeo.com 

[image: ODIGEO Hungary]

ODIGEO Hungary Kft.
1066 Budapest
Weiner Leó u. 16.

www.liligo.com  
check out our newest video  



On 11 July 2016 at 10:09, Chanh Le  wrote:

> Hi everybody,
> I am testing zeppelin with dynamic allocation but seem it’s not working.
>
>
>
>
>
>
> Logs I received I saw that Spark Context was created successfully and task
> was running but after that was terminated.
> Any ideas on that?
> Thanks.
>
>
>
>  INFO [2016-07-11 15:03:40,096] ({Thread-0}
> RemoteInterpreterServer.java[run]:81) - Starting remote interpreter server
> on port 24994
>  INFO [2016-07-11 15:03:40,471] ({pool-1-thread-2}
> RemoteInterpreterServer.java[createInterpreter]:169) - Instantiate
> interpreter org.apache.zeppelin.spark.SparkInterpreter
>  INFO [2016-07-11 15:03:40,521] ({pool-1-thread-2}
> RemoteInterpreterServer.java[createInterpreter]:169) - Instantiate
> interpreter org.apache.zeppelin.spark.PySparkInterpreter
>  INFO [2016-07-11 15:03:40,526] ({pool-1-thread-2}
> RemoteInterpreterServer.java[createInterpreter]:169) - Instantiate
> interpreter org.apache.zeppelin.spark.SparkRInterpreter
>  INFO [2016-07-11 15:03:40,528] ({pool-1-thread-2}
> RemoteInterpreterServer.java[createInterpreter]:169) - Instantiate
> interpreter org.apache.zeppelin.spark.SparkSqlInterpreter
>  INFO [2016-07-11 15:03:40,531] ({pool-1-thread-2}
> RemoteInterpreterServer.java[createInterpreter]:169) - Instantiate
> interpreter org.apache.zeppelin.spark.DepInterpreter
>  INFO [2016-07-11 15:03:40,563] ({pool-2-thread-5}
> SchedulerFactory.java[jobStarted]:131) - Job
> remoteInterpretJob_1468224220562 started by scheduler
> org.apache.zeppelin.spark.SparkInterpreter998491254
>  WARN [2016-07-11 15:03:41,559] ({pool-2-thread-5}
> NativeCodeLoader.java[]:62) - Unable to load native-hadoop library
> for your platform... using builtin-java classes where applicable
>  INFO [2016-07-11 15:03:41,703] ({pool-2-thread-5}
> Logging.scala[logInfo]:58) - Changing view acls to: root
>  INFO [2016-07-11 15:03:41,704] ({pool-2-thread-5}
> Logging.scala[logInfo]:58) - Changing modify acls to: root
>  INFO [2016-07-11 15:03:41,708] ({pool-2-thread-5}
> Logging.scala[logInfo]:58) - SecurityManager: authentication disabled; ui
> acls disabled; users with view permissions: Set(root); users with modify
> permissions: Set(root)
>  INFO [2016-07-11 15:03:41,977] ({pool-2-thread-5}
> Logging.scala[logInfo]:58) - Starting HTTP Server
>  INFO [2016-07-11 15:03:42,029] ({pool-2-thread-5}
> Server.java[doStart]:272) - jetty-8.y.z-SNAPSHOT
>  INFO [2016-07-11 15:03:42,047] ({pool-2-thread-5}
> AbstractConnector.java[doStart]:338) - Started SocketConnector@0.0.0.0
> :53313
>  INFO [2016-07-11 15:03:42,048] ({pool-2-thread-5}
> Logging.scala[logInfo]:58) - Successfully started service 'HTTP class
> server' on port 53313.
> * INFO [2016-07-11 15:03:43,978] ({pool-2-thread-5}
> SparkInterpreter.java[createSparkContext]:233) - -- Create new
> SparkContext mesos://zk://master1:2181,master2:2181,master3:2181/mesos
> ---*
>  INFO [2016-07-11 15:03:44,003] ({pool-2-thread-5}
> Logging.scala[logInfo]:58) - Running Spark version 1.6.1
>  INFO [2016-07-11 15:03:44,036] ({pool-2-thread-5}
> Logging.scala[logInfo]:58) - Changing view acls to: root
>  INFO [2016-07-11 15:03:44,036] ({pool-2-thread-5}
> Logging.scala[logInfo]:58) - Changing modify acls to: root
>  INFO [2016-07-11 15:03:44,037] ({pool-2-thread-5}
> Logging.scala[logInfo]:58) - SecurityManager: authentication disabled; ui
> acls disabled; users with view permissions: Set(root); users with modify
> permissions: Set(root)
>  INFO [2016-07-11 15:03:44,231] ({pool-2-thread-5}
> Logging.scala[logInfo]:58) - Successfully started service 'sparkDriver' on
> port 33913.
>  INFO [2016-07-11 15:03:44,552]
> ({sparkDriverActorSystem-akka.actor.default-dispatcher-4}
> Slf4jLogger.scala[applyOrElse]:80) - Slf4jLogger started
>  INFO [2016-07-11 15:03:44,597]
> ({sparkDriverActorSystem-akka.actor.default-dispatcher-4}
> Slf4jLogger.scala[apply$mcV$sp]:74) - Starting remoting
>  INFO [2016-07-11 15:03:44,754]
> ({sparkDriverActorSystem-akka.actor.default-dispatcher-4}
> Slf4jLogger.scala[apply$mcV$sp]:74) - Remoting started; listening on
> addresses :[akka.tcp://sparkDriverActorSystem@10.197.0.3:55213]
>  INFO [2016-07-11 15:03:44,760] ({pool-2-thread-5}
> Logging.scala[logInfo]:58) - Successfully started service
> 'sparkDriverActorSystem' on port 55213.
>  INFO [2016-07-11 15:03:44,771] ({pool-2-thread-5}
> Logging.scala[logInfo]:58) - Registering MapOutputTracker
>  INFO [2016-07-11 15:03:44,789] ({pool-2-thread-5}
> Logging.scala[logInfo]:58) - Registering BlockManagerMaster
>  INFO [2016-07-11 15:03:44,802] 

Re: StreamingKmeans Spark doesn't work at all

2016-07-11 Thread Biplob Biswas
Hi Shuai,

Thanks for the reply, I mentioned in the mail that I tried running the
scala example as well from the link I provided and the result is the same.

Thanks & Regards
Biplob Biswas

On Mon, Jul 11, 2016 at 5:52 AM, Shuai Lin  wrote:

> I would suggest you run the scala version of the example first, so you can
> tell whether it's a problem of the data you provided or a problem of the
> java code.
>
> On Mon, Jul 11, 2016 at 2:37 AM, Biplob Biswas 
> wrote:
>
>> Hi,
>>
>> I know i am asking again, but I tried running the same thing on mac as
>> well as some answers on the internet suggested it could be an issue with
>> the windows environment, but still nothing works.
>>
>> Can anyone atleast suggest whether its a bug with spark or is it
>> something else?
>>
>> Would be really grateful! Thanks a lot.
>>
>> Thanks & Regards
>> Biplob Biswas
>>
>> On Thu, Jul 7, 2016 at 5:21 PM, Biplob Biswas 
>> wrote:
>>
>>> Hi,
>>>
>>> Can anyone care to please look into this issue?  I would really love
>>> some assistance here.
>>>
>>> Thanks a lot.
>>>
>>> Thanks & Regards
>>> Biplob Biswas
>>>
>>> On Tue, Jul 5, 2016 at 1:00 PM, Biplob Biswas 
>>> wrote:
>>>

 Hi,

 I implemented the streamingKmeans example provided in the spark website
 but
 in Java.
 The full implementation is here,

 http://pastebin.com/CJQfWNvk

 But i am not getting anything in the output except occasional timestamps
 like one below:

 ---
 Time: 1466176935000 ms
 ---

 Also, i have 2 directories:
 "D:\spark\streaming example\Data Sets\training"
 "D:\spark\streaming example\Data Sets\test"

 and inside these directories i have 1 file each
 "samplegpsdata_train.txt"
 and "samplegpsdata_test.txt" with training data having 500 datapoints
 and
 test data with 60 datapoints.

 I am very new to the spark systems and any help is highly appreciated.


 //---//

 Now, I also have now tried using the scala implementation available
 here:

 https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/mllib/StreamingKMeansExample.scala


 and even had the training and test file provided in the format
 specified in
 that file as follows:

  * The rows of the training text files must be vector data in the form
  * `[x1,x2,x3,...,xn]`
  * Where n is the number of dimensions.
  *
  * The rows of the test text files must be labeled data in the form
  * `(y,[x1,x2,x3,...,xn])`
  * Where y is some identifier. n must be the same for train and test.


 But I still get no output on my eclipse window ... just the Time!

 Can anyone seriously help me with this?

 Thank you so much
 Biplob Biswas



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/StreamingKmeans-Spark-doesn-t-work-at-all-tp27286.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe e-mail: user-unsubscr...@spark.apache.org


>>>
>>
>


Re: Connection via JDBC to Oracle hangs after count call

2016-07-11 Thread Mich Talebzadeh
Hi Mark,


Hm. It should work. This is Spark 1.6.1 on Oracle 12c





scala> val HiveContext = new org.apache.spark.sql.hive.HiveContext(sc)

HiveContext: org.apache.spark.sql.hive.HiveContext =
org.apache.spark.sql.hive.HiveContext@70f446c



scala> var _ORACLEserver : String = "jdbc:oracle:thin:@rhes564:1521:mydb12"

_ORACLEserver: String = jdbc:oracle:thin:@rhes564:1521:mydb12



scala> var _username : String = "sh"

_username: String = sh



scala> var _password : String = ""

_password: String = sh



scala> val c = HiveContext.load("jdbc",

 | Map("url" -> _ORACLEserver,

 | "dbtable" -> "(SELECT to_char(CHANNEL_ID) AS CHANNEL_ID,
CHANNEL_DESC FROM sh.channels)",

 | "user" -> _username,

 | "password" -> _password))

warning: there were 1 deprecation warning(s); re-run with -deprecation for
details

c: org.apache.spark.sql.DataFrame = [CHANNEL_ID: string, CHANNEL_DESC:
string]



scala> c.registerTempTable("t_c")



scala> c.count

res2: Long = 5



scala> HiveContext.sql("select * from t_c").collect.foreach(println)

[3,Direct Sales]

[9,Tele Sales]

[5,Catalog]

[4,Internet]

[2,Partners]


HTH


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 11 July 2016 at 08:25, Mark Vervuurt  wrote:

> Hi Mich,
>
> sorry for bothering did you manage to solve your problem? We have a
> similar problem with Spark 1.5.2 using a JDBC connection with a DataFrame
> to an Oracle Database.
>
> Thanks,
> Mark
>
> On 12 Feb 2016, at 11:45, Mich Talebzadeh  wrote:
>
> Hi,
>
> I use the following to connect to Oracle DB from Spark shell 1.5.2
>
> spark-shell --master spark://50.140.197.217:7077 --driver-class-path
> /home/hduser/jars/ojdbc6.jar
>
> in Scala I do
>
> scala> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
> sqlContext: org.apache.spark.sql.SQLContext =
> org.apache.spark.sql.SQLContext@f9d4387
>
> scala> val channels = sqlContext.read.format("jdbc").options(
>  |  Map("url" -> "jdbc:oracle:thin:@rhes564:1521:mydb",
>  |  "dbtable" -> "(select * from sh.channels where channel_id =
> 14)",
>  |  "user" -> "sh",
>  |   "password" -> "xxx")).load
> channels: org.apache.spark.sql.DataFrame = [CHANNEL_ID: decimal(0,-127),
> CHANNEL_DESC: string, CHANNEL_CLASS: string, CHANNEL_CLASS_ID:
> decimal(0,-127), CHANNEL_TOTAL: string, CHANNEL_TOTAL_ID: decimal(0,-127)]
>
> *scala> channels.count()*
>
> But the latter command keeps hanging?
>
> Any ideas appreciated
>
> Thanks,
>
> Mich Talebzadeh
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
> http://talebzadehmich.wordpress.com
>
> NOTE: The information in this email is proprietary and confidential. This
> message is for the designated recipient only, if you are not the intended
> recipient, you should destroy it immediately. Any information in this
> message shall not be understood as given or endorsed by Peridale Technology
> Ltd, its subsidiaries or their employees, unless expressly so stated. It is
> the responsibility of the recipient to ensure that this email is virus
> free, therefore neither Peridale Technology Ltd, its subsidiaries nor their
> employees accept any responsibility.
>
>
> Met vriendelijke groet | Best regards,
> ___
>
>
> *Ir. Mark Vervuurt*
> Senior Big Data Scientist | Insights & Data
>
> Capgemini Nederland | Utrecht
> Tel.: +31 30 6890978 – Mob.: +31653670390
> www.capgemini.com
>
>  *People matter, results count.*
>
> __
>
>
>


Re: Connection via JDBC to Oracle hangs after count call

2016-07-11 Thread Mark Vervuurt
Hi Mich,

sorry for bothering did you manage to solve your problem? We have a similar 
problem with Spark 1.5.2 using a JDBC connection with a DataFrame to an Oracle 
Database.

Thanks,
Mark

> On 12 Feb 2016, at 11:45, Mich Talebzadeh  > wrote:
> 
> Hi,
>  
> I use the following to connect to Oracle DB from Spark shell 1.5.2
>  
> spark-shell --master spark://50.140.197.217:7077 
>  --driver-class-path /home/hduser/jars/ojdbc6.jar
>  
> in Scala I do
>  
> scala> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
> sqlContext: org.apache.spark.sql.SQLContext = 
> org.apache.spark.sql.SQLContext@f9d4387
>  
> scala> val channels = sqlContext.read.format("jdbc").options(
>  |  Map("url" -> "jdbc:oracle:thin:@rhes564:1521:mydb",
>  |  "dbtable" -> "(select * from sh.channels where channel_id = 14)",
>  |  "user" -> "sh",
>  |   "password" -> "xxx")).load
> channels: org.apache.spark.sql.DataFrame = [CHANNEL_ID: decimal(0,-127), 
> CHANNEL_DESC: string, CHANNEL_CLASS: string, CHANNEL_CLASS_ID: 
> decimal(0,-127), CHANNEL_TOTAL: string, CHANNEL_TOTAL_ID: decimal(0,-127)]
>  
> scala> channels.count()
>  
> But the latter command keeps hanging?
>  
> Any ideas appreciated
>  
> Thanks,
>  
> Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> 
>  
> http://talebzadehmich.wordpress.com 
>  
> NOTE: The information in this email is proprietary and confidential. This 
> message is for the designated recipient only, if you are not the intended 
> recipient, you should destroy it immediately. Any information in this message 
> shall not be understood as given or endorsed by Peridale Technology Ltd, its 
> subsidiaries or their employees, unless expressly so stated. It is the 
> responsibility of the recipient to ensure that this email is virus free, 
> therefore neither Peridale Technology Ltd, its subsidiaries nor their 
> employees accept any responsibility.

Met vriendelijke groet | Best regards,
___

Ir. Mark Vervuurt
Senior Big Data Scientist | Insights & Data

Capgemini Nederland | Utrecht
Tel.: +31 30 6890978 – Mob.: +31653670390
www.capgemini.com 
 People matter, results count.
__





Re: How to run Zeppelin and Spark Thrift Server Together

2016-07-11 Thread ayan guha
Hi

When you say "Zeppelin and STS", I am assuming you mean "Spark Interpreter"
and "JDBC interpreter" respectively.

Through Zeppelin, you can either run your own spark application (by using
Zeppelin's own spark context) using spark interpreter OR you can access
STS, which  is a spark application ie separate Spark Context using JDBC
interpreter. There should not be any need for these 2 contexts to coexist.

If you want to share data, save it to hive from either context, and you
should be able to see the data from other context.

Best
Ayan



On Mon, Jul 11, 2016 at 3:00 PM, Chanh Le  wrote:

> Hi Ayan,
> I tested It works fine but one more confuse is If my (technical) users
> want to write some code in zeppelin to apply thing into Hive table?
> Zeppelin and STS can’t share Spark Context that mean we need separated
> process? Is there anyway to use the same Spark Context of STS?
>
> Regards,
> Chanh
>
>
> On Jul 11, 2016, at 10:05 AM, Takeshi Yamamuro 
> wrote:
>
> Hi,
>
> ISTM multiple sparkcontexts are not recommended in spark.
> See: https://issues.apache.org/jira/browse/SPARK-2243
>
> // maropu
>
>
> On Mon, Jul 11, 2016 at 12:01 PM, ayan guha  wrote:
>
>> Hi
>>
>> Can you try using JDBC interpreter with STS? We are using Zeppelin+STS on
>> YARN for few months now without much issue.
>>
>> On Mon, Jul 11, 2016 at 12:48 PM, Chanh Le  wrote:
>>
>>> Hi everybody,
>>> We are using Spark to query big data and currently we’re using Zeppelin
>>> to provide a UI for technical users.
>>> Now we also need to provide a UI for business users so we use Oracle BI
>>> tools and set up a Spark Thrift Server (STS) for it.
>>>
>>> When I run both Zeppelin and STS throw error:
>>>
>>> INFO [2016-07-11 09:40:21,905] ({pool-2-thread-4}
>>> SchedulerFactory.java[jobStarted]:131) - Job
>>> remoteInterpretJob_1468204821905 started by scheduler
>>> org.apache.zeppelin.spark.SparkInterpreter835015739
>>>  INFO [2016-07-11 09:40:21,911] ({pool-2-thread-4}
>>> Logging.scala[logInfo]:58) - Changing view acls to: giaosudau
>>>  INFO [2016-07-11 09:40:21,912] ({pool-2-thread-4}
>>> Logging.scala[logInfo]:58) - Changing modify acls to: giaosudau
>>>  INFO [2016-07-11 09:40:21,912] ({pool-2-thread-4}
>>> Logging.scala[logInfo]:58) - SecurityManager: authentication disabled; ui
>>> acls disabled; users with view permissions: Set(giaosudau); users with
>>> modify permissions: Set(giaosudau)
>>>  INFO [2016-07-11 09:40:21,918] ({pool-2-thread-4}
>>> Logging.scala[logInfo]:58) - Starting HTTP Server
>>>  INFO [2016-07-11 09:40:21,919] ({pool-2-thread-4}
>>> Server.java[doStart]:272) - jetty-8.y.z-SNAPSHOT
>>>  INFO [2016-07-11 09:40:21,920] ({pool-2-thread-4}
>>> AbstractConnector.java[doStart]:338) - Started
>>> SocketConnector@0.0.0.0:54818
>>>  INFO [2016-07-11 09:40:21,922] ({pool-2-thread-4}
>>> Logging.scala[logInfo]:58) - Successfully started service 'HTTP class
>>> server' on port 54818.
>>>  INFO [2016-07-11 09:40:22,408] ({pool-2-thread-4}
>>> SparkInterpreter.java[createSparkContext]:233) - -- Create new
>>> SparkContext local[*] ---
>>>  WARN [2016-07-11 09:40:22,411] ({pool-2-thread-4}
>>> Logging.scala[logWarning]:70) - Another SparkContext is being constructed
>>> (or threw an exception in its constructor).  This may indicate an error,
>>> since only one SparkContext may be running in this JVM (see SPARK-2243).
>>> The other SparkContext was created at:
>>>
>>> Is that mean I need to setup allow multiple context? Because It’s only
>>> test in local with local mode If I deploy on mesos cluster what would
>>> happened?
>>>
>>> Need you guys suggests some solutions for that. Thanks.
>>>
>>> Chanh
>>> -
>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>>
>>>
>>
>>
>> --
>> Best Regards,
>> Ayan Guha
>>
>
>
>
> --
> ---
> Takeshi Yamamuro
>
>
>


-- 
Best Regards,
Ayan Guha