date:20160911

Re: Using Zeppelin with Spark FP

2016-09-11 Thread andy petrella

Heya, probably worth giving the Spark Notebook
 a go then.
It can plot any scala data (collection, rdd, df, ds, custom, ...), all are
reactive so they can deal with any sort of incoming data. You can ask on
the gitter  if you like.

hth
cheers

On Sun, Sep 11, 2016 at 11:12 PM Mich Talebzadeh 
wrote:

> Hi,
>
> Zeppelin is getting better.
>
> In its description it says:
>
> [image: image.png]
>
> So far so good. One feature that I have not managed to work out is
> creating plots with Spark functional programming. I can get SQL going by
> connecting to Spark thrift server and you can plot the results
>
> [image: image.png]
>
> However, if I wrote that using functional programming I won't be able to
> plot it. the plot feature is not available.
>
> Is this correct or I am missing something?
>
> Thanks
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
-- 
andy

Access HDFS within Spark Map Operation

2016-09-11 Thread Saliya Ekanayake

Hi,

I've got a text file where each line is a record. For each record, I need
to process a file in HDFS.

So if I represent these records as an RDD and invoke a map() operation on
them how can I access the HDFS within that map()? Do I have to create a
Spark context within map() or is there a better solution to that?

Thank you,
Saliya



-- 
Saliya Ekanayake
Ph.D. Candidate | Research Assistant
School of Informatics and Computing | Digital Science Center
Indiana University, Bloomington

Spark Save mode "Overwrite" -Lock wait timeout exceeded; try restarting transaction Error

2016-09-11 Thread Subhajit Purkayastha

I am using spark 1.5.2 with Memsql Database as a persistent repository

 

I am trying to update rows (based on the primary key), if it is appears more
than 1 time (basically run the save load as a Upsert operation)

 

val UpSertConf = SaveToMemSQLConf(msc.memSQLConf, 

 Some(SaveMode.Overwrite),

 //Some(SaveMode.Append),

 Map(

 "duplicate_key_behavior" ->
"Replace"

 //,"insertBatchSize" -> "100"

 

 )

)

 

 

When I set the SaveMode to "Overwrite", I get the following errors, bcos of
object locking

 

 

[cloudera@quickstart scala-2.10]$ spark-submit GLBalance-assembly-1.0.jar
/home/cloudera/Downloads/cloud_code/output/FULL/2016-09-09_15-44-17

(number of records ,204919)


[Stage 7:>  (0 + 4)
/ 7]16/09/09 21:26:01 ERROR Executor: Exception in task 1.0 in stage 7.0
(TID 22)

java.sql.SQLException: Leaf Error (127.0.0.1:3308): Lock wait timeout
exceeded; try restarting transaction

at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:996)

at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3887)

at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3823)

at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:870)

at com.mysql.jdbc.MysqlIO.sendFileToServer(MysqlIO.java:3790)

at com.mysql.jdbc.MysqlIO.readResultsForQueryOrUpdate(MysqlIO.java:2995)

at com.mysql.jdbc.MysqlIO.readAllResults(MysqlIO.java:2245)

at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2638)

at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2526)

at com.mysql.jdbc.StatementImpl.executeUpdate(StatementImpl.java:1618)

at com.mysql.jdbc.StatementImpl.executeUpdate(StatementImpl.java:1549)

at
org.apache.commons.dbcp2.DelegatingStatement.executeUpdate(DelegatingStateme
nt.java:234)

at
org.apache.commons.dbcp2.DelegatingStatement.executeUpdate(DelegatingStateme
nt.java:234)

 

 

If I change the SaveMode to "append", the load completes but all the
duplicate records gets rejected. I was told that there is a known bug 

 

https://issues.apache.org/jira/browse/SPARK-13699 (edited)

 

If there another way for me to save the data (in an Upsert mode) ?

 

Subhs

The 8th and the Largest Spark Summit is less than 8 weeks away!

2016-09-11 Thread Jules Damji

Fellow Sparkers!With every Spark Summit, an Apache Spark Community event, increasing numbers of users and developers attend. This is the eighth Summit in one of my best cosmopolitan cities in the European Union, Brussels.We are offering a special promo code* for all Apache Spark users and developers on this list: SparkSummitEU8Please register at https://spark-summit.org/eu-2016/ between now and September 23 with this promo code and get 20% off.*valid for only new registrants.I hope to see you all there!Cheers,Jules DamjiSpark Community Evangelist
--The Best Ideas Are SimpleJules S. Damjie-mail:dmatrix@comcast.nete-mail:jules.damji@gmail.comtwitter:@2twitme

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: GraphX drawing algorithm

2016-09-11 Thread Michael Malak

In chapter 10 of Spark GraphX In Action, we describe how to use Zeppelin with 
d3.js to render graphs using d3's force-directed rendering algorithm. The 
source code can be downloaded for free from 
https://www.manning.com/books/spark-graphx-in-action
  From: agc studio 
 To: user@spark.apache.org 
 Sent: Sunday, September 11, 2016 5:59 PM
 Subject: GraphX drawing algorithm

Hi all,
I was wondering if a force-directed graph drawing algorithm has been 
implemented for graphX?

Thanks

Re: Using Zeppelin with Spark FP

2016-09-11 Thread Jeff Zhang

You can plot data frame. But it is not supported for RDD AFAIK.

On Mon, Sep 12, 2016 at 5:12 AM, Mich Talebzadeh 
wrote:

> Hi,
>
> Zeppelin is getting better.
>
> In its description it says:
>
> [image: Inline images 1]
>
> So far so good. One feature that I have not managed to work out is
> creating plots with Spark functional programming. I can get SQL going by
> connecting to Spark thrift server and you can plot the results
>
> [image: Inline images 2]
>
> However, if I wrote that using functional programming I won't be able to
> plot it. the plot feature is not available.
>
> Is this correct or I am missing something?
>
> Thanks
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>



-- 
Best Regards

Jeff Zhang

GraphX drawing algorithm

2016-09-11 Thread agc studio

Hi all,

I was wondering if a force-directed graph drawing algorithm has been
implemented for graphX?

Thanks

Using Zeppelin with Spark FP

2016-09-11 Thread Mich Talebzadeh

Hi,

Zeppelin is getting better.

In its description it says:

[image: Inline images 1]

So far so good. One feature that I have not managed to work out is creating
plots with Spark functional programming. I can get SQL going by connecting
to Spark thrift server and you can plot the results

[image: Inline images 2]

However, if I wrote that using functional programming I won't be able to
plot it. the plot feature is not available.

Is this correct or I am missing something?

Thanks

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.

Re: Spark with S3 DirectOutputCommitter

2016-09-11 Thread Steve Loughran


> On 9 Sep 2016, at 21:54, Srikanth  wrote:
> 
> Hello,
> 
> I'm trying to use DirectOutputCommitter for s3a in Spark 2.0. I've tried a 
> few configs and none of them seem to work.
> Output always creates _temporary directory. Rename is killing performance.

> I read some notes about DirectOutputcommitter causing problems with 
> speculation turned on. Was this option removed entirely? 

Spark turns off any committer with the word "direct' in its name if 
speculation==true . Concurrency, see. 

even on on-speculative execution, the trouble with the direct options is that 
executor/job failures can leave incomplete/inconsistent work around —and the 
things downstream wouldn't even notice

There's work underway to address things, work which requires a consistent 
metadata store alongside S3 ( HADOOP-13345 : S3Guard).

For now: stay with the file output committer

hadoop.mapreduce.fileoutputcommitter.algorithm.version=2
hadoop.mapreduce.fileoutputcommitter.cleanup.skipped=true

Even better: use HDFS as the intermediate store for work, only do a bulk upload 
at the end.

> 
>   val spark = SparkSession.builder()
> .appName("MergeEntities")
> .config("spark.sql.warehouse.dir", 
> mergeConfig.getString("sparkSqlWarehouseDir"))
> .config("fs.s3a.buffer.dir", "/tmp")
> .config("spark.hadoop.mapred.output.committer.class", 
> classOf[DirectOutputCommitter].getCanonicalName)
> .config("mapred.output.committer.class", 
> classOf[DirectOutputCommitter].getCanonicalName)
> .config("mapreduce.use.directfileoutputcommitter", "true")
> //.config("spark.sql.sources.outputCommitterClass", 
> classOf[DirectOutputCommitter].getCanonicalName)
> .getOrCreate()
> 
> Srikanth

Re: Spark + Parquet + IBM Block Storage at Bluemix

2016-09-11 Thread Steve Loughran


On 9 Sep 2016, at 17:56, Daniel Lopes 
> wrote:

Hi, someone can help

I'm trying to use parquet in IBM Block Storage at Spark but when I try to load 
get this error:

using this config

credentials = {
  "name": "keystone",
  "auth_url": 
"https://identity.open.softlayer.com",
  "project": "object_storage_23f274c1_d11XXXe634",
  "projectId": "XXd9c4aa39b7c7eb",
  "region": "dallas",
  "userId": "X64087180b40X2b909",
  "username": "admin_9dd810f8901d48778XX",
  "password": "chX6_",
  "domainId": "c1ddad17cfcX41",
  "domainName": "10XX",
  "role": "admin"
}

def set_hadoop_config(credentials):
"""This function sets the Hadoop configuration with given credentials,
so it is possible to access data using SparkContext"""

prefix = "fs.swift.service." + credentials['name']
hconf = sc._jsc.hadoopConfiguration()
hconf.set(prefix + ".auth.url", credentials['auth_url']+'/v3/auth/tokens')
hconf.set(prefix + ".auth.endpoint.prefix", "endpoints")
hconf.set(prefix + ".tenant", credentials['projectId'])
hconf.set(prefix + ".username", credentials['userId'])
hconf.set(prefix + ".password", credentials['password'])
hconf.setInt(prefix + ".http.port", 8080)
hconf.set(prefix + ".region", credentials['region'])
hconf.setBoolean(prefix + ".public", True)

set_hadoop_config(credentials)

-

Py4JJavaErrorTraceback (most recent call last)
 in ()
> 1 train.groupby('Acordo').count().show()

Py4JJavaError: An error occurred while calling o406.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 60 in 
stage 30.0 failed 10 times, most recent failure: Lost task 60.9 in stage 30.0 
(TID 2556, yp-spark-dal09-env5-0039): 
org.apache.hadoop.fs.swift.exceptions.SwiftConfigurationException: Missing 
mandatory configuration option: fs.swift.service.keystone.auth.url


In my own code, I'd assume that the value of credentials['name'] didn't match 
that of the URL, assuming you have something like swift://bucket.keystone . 
Failing that: the options were set too late.

Instead of asking for the hadoop config and editing that, set the option in 
your spark context, before it is launched, with the prefix "hadoop"


at 
org.apache.hadoop.fs.swift.http.RestClientBindings.copy(RestClientBindings.java:223)
at 
org.apache.hadoop.fs.swift.http.RestClientBindings.bind(RestClientBindings.java:147)


Daniel Lopes
Chief Data and Analytics Officer | OneMatch
c: +55 (18) 99764-2733 | https://www.linkedin.com/in/dslopes

www.onematch.com.br

Re: Get spark metrics in code

2016-09-11 Thread Steve Loughran


> On 9 Sep 2016, at 13:20, Han JU  wrote:
> 
> Hi,
> 
> I'd like to know if there's a possibility to get spark's metrics from code. 
> For example
> 
>   val sc = new SparkContext(conf)
>   val result = myJob(sc, ...)
>   result.save(...)
>   
>   val gauge = MetricSystem.getGauge("org.apahce.spark")
>   println(gauge.getValue)  // or send to to internal aggregation service
> 
> I'm aware that there's a configuration for sending metrics to several kinds 
> of sinks but I'm more interested in a per job basis style and we use a custom 
> log/metric aggregation service for building dashboards.
> 

It's all coda hale metrics; should be retrievable somehow, for a loose 
definition of "somehow"

I'd be interested in knowing what you come up with here. 


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

"Too many elements to create a power set" on Elasticsearch

2016-09-11 Thread Kevin Burton

1.6.1 and 1.6.2 don't work on our Elasticsearch setup because we use daily
indexes.

We get the error:

"Too many elements to create a power set"

It works on SINGLE indexes.. but if I specify content_* then I get this
error.

I don't see this documented anywhere.  Is this a known issue?

Is there a potential work around?

The code works properly when I specify an explicit daily index.

-- 

We’re hiring if you know of any awesome Java Devops or Linux Operations
Engineers!

Founder/CEO Spinn3r.com
Location: *San Francisco, CA*
blog: http://burtonator.wordpress.com
… or check out my Google+ profile

Re: Spark metrics when running with YARN?

2016-09-11 Thread Jacek Laskowski

Hi Vladimir,

You'd have to talk to your cluster manager to query for all the
running Spark applications. I'm pretty sure YARN and Mesos can do that
but unsure about Spark Standalone. This is certainly not something a
Spark application's web UI could do for you since it is designed to
handle the single Spark application.

Pozdrawiam,
Jacek Laskowski

https://medium.com/@jaceklaskowski/
Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski


On Sun, Sep 11, 2016 at 11:18 AM, Vladimir Tretyakov
 wrote:
> Hello Jacek, thx a lot, it works.
>
> Is there a way how to get list of running applications from REST API? Or I
> have to try connect 4040 4041... 40xx ports and check if ports answer
> something?
>
> Best regards, Vladimir.
>
> On Sat, Sep 10, 2016 at 6:00 AM, Jacek Laskowski  wrote:
>>
>> Hi,
>>
>> That's correct. One app one web UI. Open 4041 and you'll see the other
>> app.
>>
>> Jacek
>>
>>
>> On 9 Sep 2016 11:53 a.m., "Vladimir Tretyakov"
>>  wrote:
>>>
>>> Hello again.
>>>
>>> I am trying to play with Spark version "2.11-2.0.0".
>>>
>>> Problem that REST API and UI shows me different things.
>>>
>>> I've stared 2 applications from "examples set": opened 2 consoles and run
>>> following command in each:
>>>
>>> ./bin/spark-submit   --class org.apache.spark.examples.SparkPi   --master
>>> spark://wawanawna:7077  --executor-memory 2G  --total-executor-cores 30
>>> examples/jars/spark-examples_2.11-2.0.0.jar  1
>>>
>>> Request to API endpoint:
>>>
>>> http://localhost:4040/api/v1/applications
>>>
>>> returned me following JSON:
>>>
>>> [ {
>>>   "id" : "app-20160909184529-0016",
>>>   "name" : "Spark Pi",
>>>   "attempts" : [ {
>>> "startTime" : "2016-09-09T15:45:25.047GMT",
>>> "endTime" : "1969-12-31T23:59:59.999GMT",
>>> "lastUpdated" : "2016-09-09T15:45:25.047GMT",
>>> "duration" : 0,
>>> "sparkUser" : "",
>>> "completed" : false,
>>> "startTimeEpoch" : 1473435925047,
>>> "endTimeEpoch" : -1,
>>> "lastUpdatedEpoch" : 1473435925047
>>>   } ]
>>> } ]
>>>
>>> so response contains information only about 1 application.
>>>
>>> But in reality I've started 2 applications and Spark UI shows me 2
>>> RUNNING application (please see screenshot).
>>>
>>> Does anybody maybe know answer why API and UI shows different things?
>>>
>>>
>>> Best regards, Vladimir.
>>>
>>>
>>> On Tue, Aug 30, 2016 at 3:52 PM, Vijay Kiran  wrote:

 Hi Otis,

 Did you check the REST API as documented in
 http://spark.apache.org/docs/latest/monitoring.html

 Regards,
 Vijay

 > On 30 Aug 2016, at 14:43, Otis Gospodnetić
 >  wrote:
 >
 > Hi Mich and Vijay,
 >
 > Thanks!  I forgot to include an important bit - I'm looking for a
 > programmatic way to get Spark metrics when running Spark under YARN - so 
 > JMX
 > or API of some kind.
 >
 > Thanks,
 > Otis
 > --
 > Monitoring - Log Management - Alerting - Anomaly Detection
 > Solr & Elasticsearch Consulting Support Training -
 > http://sematext.com/
 >
 >
 > On Tue, Aug 30, 2016 at 6:59 AM, Mich Talebzadeh
 >  wrote:
 > Spark UI regardless of deployment mode Standalone, yarn etc runs on
 > port 4040 by default that can be accessed directly
 >
 > Otherwise one can specify a specific port with --conf
 > "spark.ui.port=5" for example 5
 >
 > HTH
 >
 > Dr Mich Talebzadeh
 >
 > LinkedIn
 > https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
 >
 > http://talebzadehmich.wordpress.com
 >
 > Disclaimer: Use it at your own risk. Any and all responsibility for
 > any loss, damage or destruction of data or any other property which may
 > arise from relying on this email's technical content is explicitly
 > disclaimed. The author will in no case be liable for any monetary damages
 > arising from such loss, damage or destruction.
 >
 >
 > On 30 August 2016 at 11:48, Vijay Kiran  wrote:
 >
 > From Yarm RM UI, find the spark application Id, and in the application
 > details, you can click on the “Tracking URL” which should give you the 
 > Spark
 > UI.
 >
 > ./Vijay
 >
 > > On 30 Aug 2016, at 07:53, Otis Gospodnetić
 > >  wrote:
 > >
 > > Hi,
 > >
 > > When Spark is run on top of YARN, where/how can one get Spark
 > > metrics?
 > >
 > > Thanks,
 > > Otis
 > > --
 > > Monitoring - Log Management - Alerting - Anomaly Detection
 > > Solr & Elasticsearch Consulting Support Training -
 > > http://sematext.com/
 > >
 >
 >
 >

Re: Spark metrics when running with YARN?

2016-09-11 Thread Vladimir Tretyakov

Hello Jacek, thx a lot, it works.

Is there a way how to get list of running applications from REST API? Or I
have to try connect 4040 4041... 40xx ports and check if ports answer
something?

Best regards, Vladimir.

On Sat, Sep 10, 2016 at 6:00 AM, Jacek Laskowski  wrote:

> Hi,
>
> That's correct. One app one web UI. Open 4041 and you'll see the other
> app.
>
> Jacek
>
> On 9 Sep 2016 11:53 a.m., "Vladimir Tretyakov" <
> vladimir.tretya...@sematext.com> wrote:
>
>> Hello again.
>>
>> I am trying to play with Spark version "2.11-2.0.0".
>>
>> Problem that REST API and UI shows me different things.
>>
>> I've stared 2 applications from "examples set": opened 2 consoles and run
>> following command in each:
>>
>> *./bin/spark-submit   --class org.apache.spark.examples.SparkPi
>> --master spark://wawanawna:7077  --executor-memory 2G
>>  --total-executor-cores 30  examples/jars/spark-examples_2.11-2.0.0.jar
>>  1*
>>
>> Request to API endpoint:
>>
>> http://localhost:4040/api/v1/applications
>>
>> returned me following JSON:
>>
>> [ {
>>   "id" : "app-20160909184529-0016",
>>   "name" : "Spark Pi",
>>   "attempts" : [ {
>> "startTime" : "2016-09-09T15:45:25.047GMT",
>> "endTime" : "1969-12-31T23:59:59.999GMT",
>> "lastUpdated" : "2016-09-09T15:45:25.047GMT",
>> "duration" : 0,
>> "sparkUser" : "",
>> "completed" : false,
>> "startTimeEpoch" : 1473435925047,
>> "endTimeEpoch" : -1,
>> "lastUpdatedEpoch" : 1473435925047
>>   } ]
>> } ]
>>
>> so response contains information only about 1 application. But in reality
>> I've started 2 applications and Spark UI shows me 2 RUNNING application
>> (please see screenshot). Does anybody maybe know answer why API and UI
>> shows different things?
>>
>>
>> Best regards, Vladimir.
>>
>>
>> On Tue, Aug 30, 2016 at 3:52 PM, Vijay Kiran  wrote:
>>
>>> Hi Otis,
>>>
>>> Did you check the REST API as documented in
>>> http://spark.apache.org/docs/latest/monitoring.html
>>>
>>> Regards,
>>> Vijay
>>>
>>> > On 30 Aug 2016, at 14:43, Otis Gospodnetić 
>>> wrote:
>>> >
>>> > Hi Mich and Vijay,
>>> >
>>> > Thanks!  I forgot to include an important bit - I'm looking for a
>>> programmatic way to get Spark metrics when running Spark under YARN - so
>>> JMX or API of some kind.
>>> >
>>> > Thanks,
>>> > Otis
>>> > --
>>> > Monitoring - Log Management - Alerting - Anomaly Detection
>>> > Solr & Elasticsearch Consulting Support Training -
>>> http://sematext.com/
>>> >
>>> >
>>> > On Tue, Aug 30, 2016 at 6:59 AM, Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>> > Spark UI regardless of deployment mode Standalone, yarn etc runs on
>>> port 4040 by default that can be accessed directly
>>> >
>>> > Otherwise one can specify a specific port with --conf
>>> "spark.ui.port=5" for example 5
>>> >
>>> > HTH
>>> >
>>> > Dr Mich Talebzadeh
>>> >
>>> > LinkedIn  https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJ
>>> d6zP6AcPCCdOABUrV8Pw
>>> >
>>> > http://talebzadehmich.wordpress.com
>>> >
>>> > Disclaimer: Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>> >
>>> >
>>> > On 30 August 2016 at 11:48, Vijay Kiran  wrote:
>>> >
>>> > From Yarm RM UI, find the spark application Id, and in the application
>>> details, you can click on the “Tracking URL” which should give you the
>>> Spark UI.
>>> >
>>> > ./Vijay
>>> >
>>> > > On 30 Aug 2016, at 07:53, Otis Gospodnetić <
>>> otis.gospodne...@gmail.com> wrote:
>>> > >
>>> > > Hi,
>>> > >
>>> > > When Spark is run on top of YARN, where/how can one get Spark
>>> metrics?
>>> > >
>>> > > Thanks,
>>> > > Otis
>>> > > --
>>> > > Monitoring - Log Management - Alerting - Anomaly Detection
>>> > > Solr & Elasticsearch Consulting Support Training -
>>> http://sematext.com/
>>> > >
>>> >
>>> >
>>> > -
>>> > To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>> >
>>> >
>>> >
>>>
>>>
>>> -
>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>>
>>>
>>
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>

Re: Selecting the top 100 records per group by?

2016-09-11 Thread Mich Talebzadeh

You can of course do this using FP.

val wSpec = Window.partitionBy('price).orderBy(desc("price"))
df2.filter('security > "
").select(dense_rank().over(wSpec).as("rank"),'TIMECREATED, 'SECURITY,
substring('PRICE,1,7)).filter('rank<=10).show


HTH


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 11 September 2016 at 07:15, Mich Talebzadeh 
wrote:

> DENSE_RANK will give you ordering and sequence within a particular column.
> This is Hive
>
>  var sqltext = """
>  | SELECT RANK, timecreated,security, price
>  |  FROM (
>  |SELECT timecreated,security, price,
>  |   DENSE_RANK() OVER (ORDER BY price DESC ) AS RANK
>  |  FROM test.prices
>  |   ) tmp
>  |  WHERE rank <= 10
>  | """
> sql(sqltext).collect.foreach(println)
>
> [1,2016-09-09 16:55:44,Esso,99.995]
> [1,2016-09-09 21:22:52,AVIVA,99.995]
> [1,2016-09-09 21:22:52,Barclays,99.995]
> [1,2016-09-09 21:24:28,JPM,99.995]
> [1,2016-09-09 21:30:38,Microsoft,99.995]
> [1,2016-09-09 21:31:12,UNILEVER,99.995]
> [2,2016-09-09 16:54:14,BP,99.99]
> [2,2016-09-09 16:54:36,Tate & Lyle,99.99]
> [2,2016-09-09 16:56:28,EASYJET,99.99]
> [2,2016-09-09 16:59:28,IBM,99.99]
> [2,2016-09-09 20:16:10,EXPERIAN,99.99]
> [2,2016-09-09 22:25:20,Microsoft,99.99]
> [2,2016-09-09 22:53:49,Tate & Lyle,99.99]
> [3,2016-09-09 15:31:06,UNILEVER,99.985]
>
> HTH
>
>
>
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 11 September 2016 at 04:32, Kevin Burton  wrote:
>
>> Looks like you can do it with dense_rank functions.
>>
>> https://databricks.com/blog/2015/07/15/introducing-window-fu
>> nctions-in-spark-sql.html
>>
>> I setup some basic records and seems like it did the right thing.
>>
>> Now time to throw 50TB and 100 spark nodes at this problem and see what
>> happens :)
>>
>> On Sat, Sep 10, 2016 at 7:42 PM, Kevin Burton  wrote:
>>
>>> Ah.. might actually. I'll have to mess around with that.
>>>
>>> On Sat, Sep 10, 2016 at 6:06 PM, Karl Higley  wrote:
>>>
 Would `topByKey` help?

 https://github.com/apache/spark/blob/master/mllib/src/main/s
 cala/org/apache/spark/mllib/rdd/MLPairRDDFunctions.scala#L42

 Best,
 Karl

 On Sat, Sep 10, 2016 at 9:04 PM Kevin Burton 
 wrote:

> I'm trying to figure out a way to group by and return the top 100
> records in that group.
>
> Something like:
>
> SELECT TOP(100, user_id) FROM posts GROUP BY user_id;
>
> But I can't really figure out the best way to do this...
>
> There is a FIRST and LAST aggregate function but this only returns one
> column.
>
> I could do something like:
>
> SELECT * FROM posts WHERE user_id IN ( /* select top users here */ )
> LIMIT 100;
>
> But that limit is applied for ALL the records. Not each individual
> user.
>
> The only other thing I can think of is to do a manual map reduce and
> then have the reducer only return the top 100 each time...
>
> Would LOVE some advice here...
>
> --
>
> We’re hiring if you know of any awesome Java Devops or Linux
> Operations Engineers!
>
> Founder/CEO Spinn3r.com
> Location: *San Francisco, CA*
> blog: http://burtonator.wordpress.com
> … or check out my Google+ profile
> 
>
>
>>>
>>>
>>> --
>>>
>>> We’re hiring if you know of any awesome Java Devops or Linux Operations
>>> Engineers!
>>>
>>> Founder/CEO Spinn3r.com
>>> Location: *San Francisco, CA*
>>> blog: http://burtonator.wordpress.com
>>> … or check out my Google+ profile
>>> 
>>>
>>>
>>
>>
>> --
>>
>> We’re hiring if you know of any awesome Java Devops or Linux Operations
>> Engineers!
>>

Re: SparkR API problem with subsetting distributed data frame

2016-09-11 Thread Bene

I am calling dirs(x, dat) with a number for x and a distributed dataframe for
dat, like dirs(3, df).
With your logical expression Felix I would get another data frame, right?
This is not what I need, I need to extract a single value in a specific cell
for my calculations. Is that somehow possible?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/SparkR-API-problem-with-subsetting-distributed-data-frame-tp27688p27692.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

RE: classpath conflict with spark internal libraries and the spark shell.

2016-09-11 Thread Mendelson, Assaf

You can try shading the jar. Look at maven shade plugin

From: Benyi Wang [mailto:bewang.t...@gmail.com]
Sent: Saturday, September 10, 2016 1:35 AM
To: Colin Kincaid Williams
Cc: user@spark.apache.org
Subject: Re: classpath conflict with spark internal libraries and the spark 
shell.

I had a problem when I used "spark.executor.userClassPathFirst" before. I don't 
remember what the problem is.

Alternatively, you can use --driver-class-path and "--conf 
spark.executor.extraClassPath".  Unfortunately you may feel frustrated like me 
when trying to make it work.

Depends on how you run spark:
- standalone or yarn,
- run as Application or in spark-shell
The configuration will be different. It is hard to say in a short, so I wrote 
two blogs to explain it.
http://ben-tech.blogspot.com/2016/05/how-to-resolve-spark-cassandra.html
http://ben-tech.blogspot.com/2016/04/how-to-resolve-spark-cassandra.html

Hope those blogs help.

If you still have class conflict problem, you can consider to load the external 
library and its dependencies using a special classloader just like spark-hive, 
which can load the specified version of hive jars.

On Fri, Sep 9, 2016 at 2:53 PM, Colin Kincaid Williams 
> wrote:
My bad, gothos on IRC pointed me to the docs:

http://jhz.name/2016/01/10/spark-classpath.html

Thanks Gothos!

On Fri, Sep 9, 2016 at 9:23 PM, Colin Kincaid Williams 
> wrote:
> I'm using the spark shell v1.61 . I have a classpath conflict, where I
> have an external library ( not OSS either :( , can't rebuild it.)
> using httpclient-4.5.2.jar. I use spark-shell --jars
> file:/path/to/httpclient-4.5.2.jar
>
> However spark is using httpclient-4.3 internally. Then when I try to
> use the external library I get
>
> getClass.getResource("/org/apache/http/conn/ssl/SSLConnectionSocketFactory.class");
>
> res5: java.net.URL =
> jar:file:/opt/spark-1.6.1-bin-hadoop2.4/lib/spark-assembly-1.6.1-hadoop2.4.0.jar!/org/apache/http/conn/ssl/SSLConnectionSocketFactory.class
>
> How do I get spark-shell on 1.6.1 to allow me to use the external
> httpclient-4.5.2.jar for my application,and ignore it's internal
> library. Or is this not possible?

-
To unsubscribe e-mail: 
user-unsubscr...@spark.apache.org

RE: Selecting the top 100 records per group by?

2016-09-11 Thread Mendelson, Assaf


You can also create a custom aggregation function. It might provide better 
performance than dense_rank.

Consider the following example to collect everything as list:
class CollectListFunction[T](val colType: DataType) extends 
UserDefinedAggregateFunction {

  def inputSchema: StructType =
new StructType().add("inputCol", colType)

  def bufferSchema: StructType =
new StructType().add("outputCol", ArrayType(colType))

  def dataType: DataType = ArrayType(colType)

  def deterministic: Boolean = true

  def initialize(buffer: MutableAggregationBuffer): Unit = {
buffer.update(0, new mutable.ArrayBuffer[T])
  }

  def update(buffer: MutableAggregationBuffer, input: Row): Unit = {
val list = buffer.getSeq[T](0)
if (!input.isNullAt(0)) {
  val sales = input.getAs[T](0)
  buffer.update(0, list:+sales)
}
  }

  def merge(buffer1: MutableAggregationBuffer, buffer2: Row): Unit = {
buffer1.update(0, buffer1.getSeq[T](0) ++ buffer2.getSeq[T](0))
  }

  def evaluate(buffer: Row): Any = {
buffer.getSeq[T](0)
  }
}

All you would need to do is modify it to contain only the top 100…

From: burtonator2...@gmail.com 
[mailto:burtonator2...@gmail.com] On Behalf Of Kevin Burton
Sent: Sunday, September 11, 2016 6:33 AM
To: Karl Higley
Cc: user@spark.apache.org
Subject: Re: Selecting the top 100 records per group by?

Looks like you can do it with dense_rank functions.

https://databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html

I setup some basic records and seems like it did the right thing.

Now time to throw 50TB and 100 spark nodes at this problem and see what happens 
:)

On Sat, Sep 10, 2016 at 7:42 PM, Kevin Burton 
> wrote:
Ah.. might actually. I'll have to mess around with that.

On Sat, Sep 10, 2016 at 6:06 PM, Karl Higley 
> wrote:
Would `topByKey` help?

https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/rdd/MLPairRDDFunctions.scala#L42

Best,
Karl

On Sat, Sep 10, 2016 at 9:04 PM Kevin Burton 
> wrote:
I'm trying to figure out a way to group by and return the top 100 records in 
that group.

Something like:

SELECT TOP(100, user_id) FROM posts GROUP BY user_id;

But I can't really figure out the best way to do this...

There is a FIRST and LAST aggregate function but this only returns one column.

I could do something like:

SELECT * FROM posts WHERE user_id IN ( /* select top users here */ ) LIMIT 100;

But that limit is applied for ALL the records. Not each individual user.

The only other thing I can think of is to do a manual map reduce and then have 
the reducer only return the top 100 each time...

Would LOVE some advice here...

--
We’re hiring if you know of any awesome Java Devops or Linux Operations 
Engineers!

Founder/CEO Spinn3r.com
Location: San Francisco, CA
blog: http://burtonator.wordpress.com
… or check out my Google+ 
profile




--
We’re hiring if you know of any awesome Java Devops or Linux Operations 
Engineers!

Founder/CEO Spinn3r.com
Location: San Francisco, CA
blog: http://burtonator.wordpress.com
… or check out my Google+ 
profile




--
We’re hiring if you know of any awesome Java Devops or Linux Operations 
Engineers!

Founder/CEO Spinn3r.com
Location: San Francisco, CA
blog: http://burtonator.wordpress.com
… or check out my Google+ 
profile

RE: add jars like spark-csv to ipython notebook with pyspakr

2016-09-11 Thread Mendelson, Assaf

In my case I do the following:
export PYSPARK_DRIVER_PYTHON_OPTS="notebook  --no-browser"
pyspark --jars myjar.jar --driver-class-path myjar.jar
hope this helps…

From: pseudo oduesp [mailto:pseudo20...@gmail.com]
Sent: Friday, September 09, 2016 3:55 PM
To: user@spark.apache.org
Subject: add jars like spark-csv to ipython notebook with pyspakr

Hi ,
how i can add jar to Ipython notebooke
i tied Pyspark_submit_args without succes ?
thanks

Re: Selecting the top 100 records per group by?

2016-09-11 Thread Mich Talebzadeh

DENSE_RANK will give you ordering and sequence within a particular column.
This is Hive

 var sqltext = """
 | SELECT RANK, timecreated,security, price
 |  FROM (
 |SELECT timecreated,security, price,
 |   DENSE_RANK() OVER (ORDER BY price DESC ) AS RANK
 |  FROM test.prices
 |   ) tmp
 |  WHERE rank <= 10
 | """
sql(sqltext).collect.foreach(println)

[1,2016-09-09 16:55:44,Esso,99.995]
[1,2016-09-09 21:22:52,AVIVA,99.995]
[1,2016-09-09 21:22:52,Barclays,99.995]
[1,2016-09-09 21:24:28,JPM,99.995]
[1,2016-09-09 21:30:38,Microsoft,99.995]
[1,2016-09-09 21:31:12,UNILEVER,99.995]
[2,2016-09-09 16:54:14,BP,99.99]
[2,2016-09-09 16:54:36,Tate & Lyle,99.99]
[2,2016-09-09 16:56:28,EASYJET,99.99]
[2,2016-09-09 16:59:28,IBM,99.99]
[2,2016-09-09 20:16:10,EXPERIAN,99.99]
[2,2016-09-09 22:25:20,Microsoft,99.99]
[2,2016-09-09 22:53:49,Tate & Lyle,99.99]
[3,2016-09-09 15:31:06,UNILEVER,99.985]

HTH







Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 11 September 2016 at 04:32, Kevin Burton  wrote:

> Looks like you can do it with dense_rank functions.
>
> https://databricks.com/blog/2015/07/15/introducing-window-
> functions-in-spark-sql.html
>
> I setup some basic records and seems like it did the right thing.
>
> Now time to throw 50TB and 100 spark nodes at this problem and see what
> happens :)
>
> On Sat, Sep 10, 2016 at 7:42 PM, Kevin Burton  wrote:
>
>> Ah.. might actually. I'll have to mess around with that.
>>
>> On Sat, Sep 10, 2016 at 6:06 PM, Karl Higley  wrote:
>>
>>> Would `topByKey` help?
>>>
>>> https://github.com/apache/spark/blob/master/mllib/src/main/s
>>> cala/org/apache/spark/mllib/rdd/MLPairRDDFunctions.scala#L42
>>>
>>> Best,
>>> Karl
>>>
>>> On Sat, Sep 10, 2016 at 9:04 PM Kevin Burton  wrote:
>>>
 I'm trying to figure out a way to group by and return the top 100
 records in that group.

 Something like:

 SELECT TOP(100, user_id) FROM posts GROUP BY user_id;

 But I can't really figure out the best way to do this...

 There is a FIRST and LAST aggregate function but this only returns one
 column.

 I could do something like:

 SELECT * FROM posts WHERE user_id IN ( /* select top users here */ )
 LIMIT 100;

 But that limit is applied for ALL the records. Not each individual
 user.

 The only other thing I can think of is to do a manual map reduce and
 then have the reducer only return the top 100 each time...

 Would LOVE some advice here...

 --

 We’re hiring if you know of any awesome Java Devops or Linux Operations
 Engineers!

 Founder/CEO Spinn3r.com
 Location: *San Francisco, CA*
 blog: http://burtonator.wordpress.com
 … or check out my Google+ profile
 


>>
>>
>> --
>>
>> We’re hiring if you know of any awesome Java Devops or Linux Operations
>> Engineers!
>>
>> Founder/CEO Spinn3r.com
>> Location: *San Francisco, CA*
>> blog: http://burtonator.wordpress.com
>> … or check out my Google+ profile
>> 
>>
>>
>
>
> --
>
> We’re hiring if you know of any awesome Java Devops or Linux Operations
> Engineers!
>
> Founder/CEO Spinn3r.com
> Location: *San Francisco, CA*
> blog: http://burtonator.wordpress.com
> … or check out my Google+ profile
> 
>
>

Re: Using Zeppelin with Spark FP

Access HDFS within Spark Map Operation

Spark Save mode "Overwrite" -Lock wait timeout exceeded; try restarting transaction Error

The 8th and the Largest Spark Summit is less than 8 weeks away!

Re: GraphX drawing algorithm

Re: Using Zeppelin with Spark FP

GraphX drawing algorithm

Using Zeppelin with Spark FP

Re: Spark with S3 DirectOutputCommitter

Re: Spark + Parquet + IBM Block Storage at Bluemix

Re: Get spark metrics in code

"Too many elements to create a power set" on Elasticsearch

Re: Spark metrics when running with YARN?

Re: Spark metrics when running with YARN?

Re: Selecting the top 100 records per group by?

Re: SparkR API problem with subsetting distributed data frame

RE: classpath conflict with spark internal libraries and the spark shell.

RE: Selecting the top 100 records per group by?

RE: add jars like spark-csv to ipython notebook with pyspakr

Re: Selecting the top 100 records per group by?

20 matches

Site Navigation

Mail list logo

Footer information