Spark Metrics: custom source/sink configurations not getting recognized

2016-09-05 Thread map reduced
Hi,

I've written my custom metrics source/sink for my Spark streaming app and I
am trying to initialize it from metrics.properties - but that doesn't work
from executors. I don't have control on the machines in Spark cluster, so I
can't copy properties file in $SPARK_HOME/conf/ in the cluster. I have it
in the fat jar where my app lives, but by the time my fat jar is downloaded
on worker nodes in cluster, executors are already started and their Metrics
system is already initialized - thus not picking my file with custom source
configuration in it.

Following this post
,
I've specified 'spark.files
 =
metrics.properties' and 'spark.metrics.conf=metrics.properties' but by the
time 'metrics.properties' is shipped to executors, their metric system is
already initialized.

If I initialize my own metrics system, it's picking up my file but then I'm
missing master/executor level metrics/properties (eg.
executor.sink.mySink.propName=myProp - can't read 'propName' from 'mySink')
since they are initialized

by
Spark's metric system.

Is there a (programmatic) way to have 'metrics.properties' shipped before
executors initialize

 ?

Here's my SO question

.

Thanks,

KP


Re: Any estimate for a Spark 2.0.1 release date?

2016-09-05 Thread Takeshi Yamamuro
Hi,

Have you seen this?

// maropu

On Tue, Sep 6, 2016 at 7:42 AM, mhornbech  wrote:

> I can't find any JIRA issues with the tag that are unresolved. Apologies if
> this is a rookie mistake and the information is available elsewhere.
>
> Morten
>
>
>
> --
> View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/Any-estimate-for-a-Spark-2-0-1-
> release-date-tp27659.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


-- 
---
Takeshi Yamamuro


Re: Spark 2.0.0 Thrift Server problem with Hive metastore

2016-09-05 Thread Jeff Zhang
How do you upgrade to spark 2.0 ?

On Mon, Sep 5, 2016 at 11:25 PM, Campagnola, Francesco <
francesco.campagn...@anritsu.com> wrote:

> Hi,
>
>
>
> in an already working Spark - Hive environment with Spark 1.6 and Hive
> 1.2.1, with Hive metastore configured on Postgres DB, I have upgraded Spark
> to the 2.0.0.
>
>
>
> I have started the thrift server on YARN, then tried to execute from the
> beeline cli or a jdbc client the following command:
>
> SHOW DATABASES;
>
> It always gives this error on Spark server side:
>
>
>
> spark@spark-test[spark] /home/spark> beeline -u
> jdbc:hive2://$(hostname):1 -n spark
>
>
>
> Connecting to jdbc:hive2://spark-test:1
>
> 16/09/05 17:41:43 INFO jdbc.Utils: Supplied authorities: spark-test:1
>
> 16/09/05 17:41:43 INFO jdbc.Utils: Resolved authority: spark-test:1
>
> 16/09/05 17:41:43 INFO jdbc.HiveConnection: Will try to open client
> transport with JDBC Uri: jdbc:hive2:// spark-test:1
>
> Connected to: Spark SQL (version 2.0.0)
>
> Driver: Hive JDBC (version 1.2.1.spark2)
>
> Transaction isolation: TRANSACTION_REPEATABLE_READ
>
> Beeline version 1.2.1.spark2 by Apache Hive
>
>
>
> 0: jdbc:hive2:// spark-test:1> show databases;
>
> java.lang.IllegalStateException: Can't overwrite cause with
> java.lang.ClassCastException: 
> org.apache.spark.sql.catalyst.expressions.GenericInternalRow
> cannot be cast to org.apache.spark.sql.catalyst.expressions.UnsafeRow
>
> at java.lang.Throwable.initCause(Throwable.java:457)
>
> at org.apache.hive.service.cli.HiveSQLException.toStackTrace(
> HiveSQLException.java:236)
>
> at org.apache.hive.service.cli.HiveSQLException.toStackTrace(
> HiveSQLException.java:236)
>
> at org.apache.hive.service.cli.HiveSQLException.toCause(
> HiveSQLException.java:197)
>
> at org.apache.hive.service.cli.HiveSQLException.(
> HiveSQLException.java:108)
>
> at org.apache.hive.jdbc.Utils.verifySuccess(Utils.java:256)
>
> at org.apache.hive.jdbc.Utils.verifySuccessWithInfo(Utils.
> java:242)
>
> at org.apache.hive.jdbc.HiveQueryResultSet.next(
> HiveQueryResultSet.java:365)
>
> at org.apache.hive.beeline.BufferedRows.(
> BufferedRows.java:42)
>
> at org.apache.hive.beeline.BeeLine.print(BeeLine.java:1794)
>
> at org.apache.hive.beeline.Commands.execute(Commands.java:860)
>
> at org.apache.hive.beeline.Commands.sql(Commands.java:713)
>
> at org.apache.hive.beeline.BeeLine.dispatch(BeeLine.java:973)
>
> at org.apache.hive.beeline.BeeLine.execute(BeeLine.java:813)
>
> at org.apache.hive.beeline.BeeLine.begin(BeeLine.java:771)
>
> at org.apache.hive.beeline.BeeLine.mainWithInputRedirection(
> BeeLine.java:484)
>
> at org.apache.hive.beeline.BeeLine.main(BeeLine.java:467)
>
> Caused by: org.apache.spark.SparkException: Job aborted due to stage
> failure: Task 0 in stage 3.0 failed 10 times, most recent failure: Lost
> task 0.9 in stage 3.0 (TID 12, vertica204): java.lang.ClassCastException:
> org.apache.spark.sql.catalyst.expressions.GenericInternalRow cannot be
> cast to org.apache.spark.sql.catalyst.expressions.UnsafeRow
>
> at org.apache.spark.sql.execution.SparkPlan$$anonfun$
> 4.apply(SparkPlan.scala:247)
>
> at org.apache.spark.sql.execution.SparkPlan$$anonfun$
> 4.apply(SparkPlan.scala:240)
>
> at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$
> 1$$anonfun$apply$24.apply(RDD.scala:784)
>
> at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$
> 1$$anonfun$apply$24.apply(RDD.scala:784)
>
> at org.apache.spark.rdd.MapPartitionsRDD.compute(
> MapPartitionsRDD.scala:38)
>
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.
> scala:70)
>
> at org.apache.spark.scheduler.Task.run(Task.scala:85)
>
> at org.apache.spark.executor.Executor$TaskRunner.run(
> Executor.scala:274)
>
> at java.util.concurrent.ThreadPoolExecutor.runWorker(
> ThreadPoolExecutor.java:1142)
>
> at java.util.concurrent.ThreadPoolExecutor$Worker.run(
> ThreadPoolExecutor.java:617)
>
> at java.lang.Thread.run(Thread.java:745)
>
>
>
> Driver stacktrace:
>
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> Method)
>
> at sun.reflect.NativeConstructorAccessorImpl.newInstance(
> NativeConstructorAccessorImpl.java:62)
>
> at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(
> DelegatingConstructorAccessorImpl.java:45)
>
> at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
>
> at org.apache.hive.service.cli.HiveSQLException.newInstance(
> HiveSQLException.java:244)
>
> at org.apache.hive.service.cli.HiveSQLException.toStackTrace(
> HiveSQLException.java:210)
>
> ... 15 more
>
> Error: Error 

Re: Scala Vs Python

2016-09-05 Thread Luciano Resende
On Thu, Sep 1, 2016 at 3:15 PM, darren  wrote:

> This topic is a concern for us as well. In the data science world no one
> uses native scala or java by choice. It's R and Python. And python is
> growing. Yet in spark, python is 3rd in line for feature support, if at all.
>
> This is why we have decoupled from spark in our project. It's really
> unfortunate spark team have invested so heavily in scale.
>
> As for speed it comes from horizontal scaling and throughout. When you can
> scale outward, individual VM performance is less an issue. Basic HPC
> principles.
>
>
You could still try to get best of the both worlds, having your data
scientists writing their algorithms using Python and/or R and have a
compiler/optimizer handling the optimizations to run in a distributed
fashion in a spark cluster leveraging some of the low level apis written in
java/scala. Take a look at Apache SystemML http://systemml.apache.org/ for
more details.



-- 
Luciano Resende
http://twitter.com/lresende1975
http://lresende.blogspot.com/


Any estimate for a Spark 2.0.1 release date?

2016-09-05 Thread mhornbech
I can't find any JIRA issues with the tag that are unresolved. Apologies if
this is a rookie mistake and the information is available elsewhere.

Morten



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Any-estimate-for-a-Spark-2-0-1-release-date-tp27659.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Splitting columns from a text file

2016-09-05 Thread Gourav Sengupta
just use SPARK CSV, all other ways of splitting and working is just trying
to reinvent the wheel and a magnanimous waste of time.


Regards,
Gourav

On Mon, Sep 5, 2016 at 1:48 PM, Ashok Kumar 
wrote:

> Hi,
>
> I have a text file as below that I read in
>
> 74,20160905-133143,98.11218069128827594148
> 75,20160905-133143,49.52776998815916807742
> 76,20160905-133143,56.08029957123980984556
> 77,20160905-133143,46.63689526544407522777
> 78,20160905-133143,84.88227141164402181551
> 79,20160905-133143,68.72408602520662115000
>
> val textFile = sc.textFile("/tmp/mytextfile.txt")
>
> Now I want to split the rows separated by ","
>
> scala> textFile.map(x=>x.toString).split(",")
> :27: error: value split is not a member of
> org.apache.spark.rdd.RDD[String]
>textFile.map(x=>x.toString).split(",")
>
> However, the above throws error?
>
> Any ideas what is wrong or how I can do this if I can avoid converting it
> to String?
>
> Thanking
>
>


Re: Scala Vs Python

2016-09-05 Thread Gourav Sengupta
The pertinent question is between "functional programming" and procedural
or OOPs.

I think when you are dealing with data solutions, functional programming is
a more natural way to think and work.


Regards,
Gourav

On Sun, Sep 4, 2016 at 11:17 AM, AssafMendelson 
wrote:

> I don’t have anything off the hand (Unfortunately I didn’t really save it)
> but you can easily make some toy examples.
>
> For example you might do something like defining a simple UDF (e.g. test
> if number < 10)
>
> Then create the function in scala:
>
>
>
> package com.example
>
> import org.apache.spark.sql.functions.udf
>
>
>
> object udfObj extends Serializable {
>
>   def createUDF = {
>
> udf((x: Int) => x < 10)
>
>   }
>
> }
>
>
>
> Compile the scala and run pyspark with --jars --driver-class-path on the
> created jar.
>
> Inside pyspark do something like:
>
>
>
> from py4j.java_gateway import java_import
>
> from pyspark.sql.column import Column
>
> from pyspark.sql.functions import udf
>
> from pyspark.sql.types import BooleanType
>
> import time
>
>
>
> jvm = sc._gateway.jvm
>
> java_import(jvm, "com.example")
>
> def udf_scala(col):
>
> return Column(jvm.com.example.udfObj.createUDF().apply(col))
>
>
>
> udf_python = udf(lambda x: x<10, BooleanType())
>
>
>
> df = spark.range(1000)
>
> df.cache()
>
> df.count()
>
>
>
> df1 = df.filter(df.id < 10)
>
> df2 = df.filter(udf_scala(df.id))
>
> df3 = df.filter(udf_python(df.id))
>
>
>
> t1 = time.time()
>
> df1.count()
>
> t2 = time.time()
>
> df2.count()
>
> t3 = time.time()
>
> df3.count()
>
> t4 = time.time()
>
>
>
> print “time for builtin “ + str(t2-t1)
>
> print “time for scala “ + str(t3-t2)
>
> print “time for python “  + str(t4-t3)
>
>
>
>
>
>
>
> The differences between the times should give you how long it takes (note
> the caching is done in order to make sure we don’t have issues where the
> range is created once and then reused) .
>
> BTW, I saw this can be very touchy in terms of the cluster and its
> configuration. I ran it on two different cluster configurations and ran it
> several times to get some idea on the noise.
>
> Of course, the more complicated the UDF, the less the overhead affects you.
>
> Hope this helps.
>
> Assaf
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> *From:* ayan guha [mailto:[hidden email]
> ]
> *Sent:* Sunday, September 04, 2016 11:00 AM
> *To:* Mendelson, Assaf
> *Cc:* user
> *Subject:* Re: Scala Vs Python
>
>
>
> Hi
>
>
>
> This one is quite interesting. Is it possible to share few toy examples?
>
>
>
> On Sun, Sep 4, 2016 at 5:23 PM, AssafMendelson <[hidden email]
> > wrote:
>
> I am not aware of any official testing but you can easily create your own.
>
> In testing I made I saw that python UDF were more than 10 times slower
> than scala UDF (and in some cases it was closer to 50 times slower).
>
> That said, it would depend on how you use your UDF.
>
> For example, lets say you have a 1 billion row table which you do some
> aggregation on and left with a 10K rows table. If you do the python UDF in
> the beginning then it might have a hard hit but if you do it on the 10K
> rows table then the overhead might be negligible.
>
> Furthermore, you can always write the UDF in scala and wrap it.
>
> This is something my team did. We have data scientists working on spark in
> python. Normally, they can use the existing functions to do what they need
> (Spark already has a pretty nice spread of functions which answer most of
> the common use cases). When they need a new UDF or UDAF they simply ask my
> team (which does the engineering) and we write them a scala one and then
> wrap it to be accessible from python.
>
>
>
>
>
> *From:* ayan guha [mailto:[hidden email]
> ]
> *Sent:* Friday, September 02, 2016 12:21 AM
> *To:* kant kodali
> *Cc:* Mendelson, Assaf; user
> *Subject:* Re: Scala Vs Python
>
>
>
> Thanks All for your replies.
>
>
>
> Feature Parity:
>
>
>
> MLLib, RDD and dataframes features are totally comparable. Streaming is
> now at par in functionality too, I believe. However, what really worries me
> is not having Dataset APIs at all in Python. I think thats a deal breaker.
>
>
>
> Performance:
>
> I do  get this bit when RDDs are involved, but not when Data frame is the
> only construct I am operating on.  Dataframe supposed to be
> language-agnostic in terms of performance.  So why people think python is
> slower? is it because of using UDF? Any other reason?
>
>
>
> *Is there any kind of benchmarking/stats around Python UDF vs Scala UDF
> comparison? like the one out there  b/w RDDs.*
>
>
>
> @Kant:  I am not comparing ANY applications. I am comparing SPARK
> applications only. I would be glad to hear your opinion on why pyspark
> applications will not work, if you have any benchmarks please share if
> possible.
>
>
>
>
>
>
>
>
>
>
>
> On Fri, Sep 

Spark ML 2.1.0 new features

2016-09-05 Thread janardhan shetty
Is there any documentation or links on the new features which we can expect
for Spark ML 2.1.0 release ?


Cassandra timestamp to spark Date field

2016-09-05 Thread Selvam Raman
Hi All,

As per datastax report Cassandra to spark type
timestamp Long, java.util.Date, java.sql.Date, org.joda.time.DateTime

Please help me with your input.

I have a Cassandra table with 30 fields. Out of it 3 are timestamp.

I read cassandratable using sc.cassandraTable
 
[com.datastax.spark.connector.rdd.CassandraTableScanRDD[com.datastax.spark.connector.CassandraRow]
= CassandraTableScanRDD[9] at RDD at CassandraRDD.scala:15]

then I have converted to row of rdd

*val* exis_repair_fact = sqlContext.createDataFrame(rddrepfact.map(r =>
org.apache.spark.sql.Row.fromSeq(r.columnValues)),schema)

in schema fields I have mentioned timestamp as

*StructField*("shipped_datetime", *DateType*),


when I try to show the result, it throws java.util.Date can not convert to
java.sql.Date.


so how can I solve the issue.


First I have converted cassandrascanrdd to



-- 
Selvam Raman
"லஞ்சம் தவிர்த்து நெஞ்சம் நிமிர்த்து"


Re: Spark SQL Tables on top of HBase Tables

2016-09-05 Thread Yan Zhou
There is a HSpark project, https://github.com/yzhou2001/HSpark, providing
native and fast access to HBase.
Currently it only supports Spark 1.4, but any suggestions and contributions
are more than welcome.

 Try it out to find its speedups!

On Sat, Sep 3, 2016 at 12:57 PM, Mich Talebzadeh 
wrote:

> Mine is Hbase-0.98,
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 3 September 2016 at 20:51, Benjamin Kim  wrote:
>
>> I’m using Spark 1.6 and HBase 1.2. Have you got it to work using these
>> versions?
>>
>> On Sep 3, 2016, at 12:49 PM, Mich Talebzadeh 
>> wrote:
>>
>> I am trying to find a solution for this
>>
>> ERROR log: error in initSerDe: java.lang.ClassNotFoundException Class
>> org.apache.hadoop.hive.hbase.HBaseSerDe not found
>>
>> I am using Spark 2 and Hive 2!
>>
>> HTH
>>
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 3 September 2016 at 20:31, Benjamin Kim  wrote:
>>
>>> Mich,
>>>
>>> I’m in the same boat. We can use Hive but not Spark.
>>>
>>> Cheers,
>>> Ben
>>>
>>> On Sep 2, 2016, at 3:37 PM, Mich Talebzadeh 
>>> wrote:
>>>
>>> Hi,
>>>
>>> You can create Hive external  tables on top of existing Hbase table
>>> using the property
>>>
>>> STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
>>>
>>> Example
>>>
>>> hive> show create table hbase_table;
>>> OK
>>> CREATE TABLE `hbase_table`(
>>>   `key` int COMMENT '',
>>>   `value1` string COMMENT '',
>>>   `value2` int COMMENT '',
>>>   `value3` int COMMENT '')
>>> ROW FORMAT SERDE
>>>   'org.apache.hadoop.hive.hbase.HBaseSerDe'
>>> STORED BY
>>>   'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
>>> WITH SERDEPROPERTIES (
>>>   'hbase.columns.mapping'=':key,a:b,a:c,d:e',
>>>   'serialization.format'='1')
>>> TBLPROPERTIES (
>>>   'transient_lastDdlTime'='1472370939')
>>>
>>>  Then try to access this Hive table from Spark which is giving me grief
>>> at the moment :(
>>>
>>> scala> HiveContext.sql("use test")
>>> res9: org.apache.spark.sql.DataFrame = []
>>> scala> val hbase_table= spark.table("hbase_table")
>>> 16/09/02 23:31:07 ERROR log: error in initSerDe:
>>> java.lang.ClassNotFoundException Class 
>>> org.apache.hadoop.hive.hbase.HBaseSerDe
>>> not found
>>>
>>> HTH
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> *
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>> On 2 September 2016 at 23:08, KhajaAsmath Mohammed <
>>> mdkhajaasm...@gmail.com> wrote:
>>>
 Hi Kim,

 I am also looking for same information. Just got the same requirement
 today.

 Thanks,
 Asmath

 On Fri, Sep 2, 2016 at 4:46 PM, Benjamin Kim 
 wrote:

> I was wondering if anyone has tried to create Spark SQL tables on top
> of HBase tables so that data in HBase can be accessed using Spark
> Thriftserver with SQL statements? This is similar what can be done using
> Hive.
>
> Thanks,
> Ben
>
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>

>>>
>>>
>>
>>
>


Re: Splitting columns from a text file

2016-09-05 Thread Somasundaram Sekar
sc.textFile("filename").map(_.split(",")).filter(arr => arr.length == 3 &&
arr(2).toDouble > 50).collect this will give you a Array[Array[String]] do
as you may wish with it. And please read through abt RDD

On 5 Sep 2016 8:51 pm, "Ashok Kumar"  wrote:

> Thanks everyone.
>
> I am not skilled like you gentlemen
>
> This is what I did
>
> 1) Read the text file
>
> val textFile = sc.textFile("/tmp/myfile.txt")
>
> 2) That produces an RDD of String.
>
> 3) Create a DF after splitting the file into an Array
>
> val df = textFile.map(line => line.split(",")).map(x=>(x(0).
> toInt,x(1).toString,x(2).toDouble)).toDF
>
> 4) Create a class for column headers
>
>  case class Columns(col1: Int, col2: String, col3: Double)
>
> 5) Assign the column headers
>
> val h = df.map(p => Columns(p(0).toString.toInt, p(1).toString,
> p(2).toString.toDouble))
>
> 6) Only interested in column 3 > 50
>
>  h.filter(col("Col3") > 50.0)
>
> 7) Now I just want Col3 only
>
> h.filter(col("Col3") > 50.0).select("col3").show(5)
> +-+
> | col3|
> +-+
> |95.42536350467836|
> |61.56297588648554|
> |76.73982017179868|
> |68.86218120274728|
> |67.64613810115105|
> +-+
> only showing top 5 rows
>
> Does that make sense. Are there shorter ways gurus? Can I just do all this
> on RDD without DF?
>
> Thanking you
>
>
>
>
>
>
>
> On Monday, 5 September 2016, 15:19, ayan guha  wrote:
>
>
> Then, You need to refer third term in the array, convert it to your
> desired data type and then use filter.
>
>
> On Tue, Sep 6, 2016 at 12:14 AM, Ashok Kumar  wrote:
>
> Hi,
> I want to filter them for values.
>
> This is what is in array
>
> 74,20160905-133143,98. 11218069128827594148
>
> I want to filter anything > 50.0 in the third column
>
> Thanks
>
>
>
>
> On Monday, 5 September 2016, 15:07, ayan guha  wrote:
>
>
> Hi
>
> x.split returns an array. So, after first map, you will get RDD of arrays.
> What is your expected outcome of 2nd map?
>
> On Mon, Sep 5, 2016 at 11:30 PM, Ashok Kumar  > wrote:
>
> Thank you sir.
>
> This is what I get
>
> scala> textFile.map(x=> x.split(","))
> res52: org.apache.spark.rdd.RDD[ Array[String]] = MapPartitionsRDD[27] at
> map at :27
>
> How can I work on individual columns. I understand they are strings
>
> scala> textFile.map(x=> x.split(",")).map(x => (x.getString(0))
>  | )
> :27: error: value getString is not a member of Array[String]
>textFile.map(x=> x.split(",")).map(x => (x.getString(0))
>
> regards
>
>
>
>
> On Monday, 5 September 2016, 13:51, Somasundaram Sekar  tigeranalytics.com > wrote:
>
>
> Basic error, you get back an RDD on transformations like map.
> sc.textFile("filename").map(x => x.split(",")
>
> On 5 Sep 2016 6:19 pm, "Ashok Kumar"  wrote:
>
> Hi,
>
> I have a text file as below that I read in
>
> 74,20160905-133143,98. 11218069128827594148
> 75,20160905-133143,49. 52776998815916807742
> 76,20160905-133143,56. 08029957123980984556
> 77,20160905-133143,46. 63689526544407522777
> 78,20160905-133143,84. 88227141164402181551
> 79,20160905-133143,68. 72408602520662115000
>
> val textFile = sc.textFile("/tmp/mytextfile. txt")
>
> Now I want to split the rows separated by ","
>
> scala> textFile.map(x=>x.toString). split(",")
> :27: error: value split is not a member of
> org.apache.spark.rdd.RDD[ String]
>textFile.map(x=>x.toString). split(",")
>
> However, the above throws error?
>
> Any ideas what is wrong or how I can do this if I can avoid converting it
> to String?
>
> Thanking
>
>
>
>
>
>
> --
> Best Regards,
> Ayan Guha
>
>
>
>
>
> --
> Best Regards,
> Ayan Guha
>
>
>


Re: Problem in accessing swebhdfs

2016-09-05 Thread Steve Loughran
Looks like it got a 404 back with a text/plain response, tried to parse that as 
JSON and made a mess of things. Updated the relevant (still open) JIRA with 
your stack trace.

https://issues.apache.org/jira/browse/HDFS-6220


At a guess, the file it is looking for isn't there. Causes

-the root path for the input/glob pattern isn' there
-the path has a character (e.g. ":") that webhdfs can't handle

Check the path settings and the URL itself. Interesting about the text/plain, 
that makes me wonder if there's a proxy getting involved.

Not much else that can be done here, maybe look at the logs.

-Steve

On 4 Sep 2016, at 23:25, Sourav Mazumder 
> wrote:

Hi,

When I try to access a swebhdfs uri I get following error.

In my hadoop cluster webhdfs is enabled.

Also I can access the same resource using webhdfs API from a http client with 
SSL.

Any idea what is going wring ?

Regards,
Sourav

java.io.IOException: Unexpected HTTP response: code=404 != 200, 
op=GETFILESTATUS, message=Not Found
at 
org.apache.hadoop.hdfs.web.WebHdfsFileSystem.validateResponse(WebHdfsFileSystem.java:347)
at 
org.apache.hadoop.hdfs.web.WebHdfsFileSystem.access$200(WebHdfsFileSystem.java:90)
at 
org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractRunner.runWithRetry(WebHdfsFileSystem.java:613)
at 
org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractRunner.access$100(WebHdfsFileSystem.java:463)
at 
org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractRunner$1.run(WebHdfsFileSystem.java:492)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at 
org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractRunner.run(WebHdfsFileSystem.java:488)
at 
org.apache.hadoop.hdfs.web.WebHdfsFileSystem.getHdfsFileStatus(WebHdfsFileSystem.java:848)
at 
org.apache.hadoop.hdfs.web.WebHdfsFileSystem.getFileStatus(WebHdfsFileSystem.java:858)
at org.apache.hadoop.fs.Globber.getFileStatus(Globber.java:57)
at org.apache.hadoop.fs.Globber.glob(Globber.java:252)
at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:1674)
at 
org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:259)
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:229)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:315)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:200)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:248)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:246)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:246)
at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:248)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:246)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:246)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1911)
at org.apache.spark.rdd.RDD.count(RDD.scala:1115)
... 50 elided
Caused by: java.io.IOException: Content-Type "text/html;charset=ISO-8859-1" is 
incompatible with "application/json" (parsed="text/html;charset=ISO-8859-1")
at 
org.apache.hadoop.hdfs.web.WebHdfsFileSystem.jsonParse(WebHdfsFileSystem.java:320)
at 
org.apache.hadoop.hdfs.web.WebHdfsFileSystem.validateResponse(WebHdfsFileSystem.java:343)
... 78 more




Re: S3A + EMR failure when writing Parquet?

2016-09-05 Thread Steve Loughran

On 4 Sep 2016, at 18:05, Everett Anderson 
> wrote:

My impression from reading your various other replies on S3A is that it's also 
best to use mapreduce.fileoutputcommitter.algorithm.version=2 (which might 
someday be the default) 
and,

for now yes; there's work under way by various people to implement consistency 
and cache performance: S3guard 
https://issues.apache.org/jira/browse/HADOOP-13345  . That'll need to come with 
a new commit algorithm which works with it and other object stores with similar 
semantics (Azure WASB). I want an O(1) commit there with a very small (1).

presumably if your data fits well in memory, use fs.s3a.fast.upload=true. Is 
that right?


as of last week: no.

Having written a test to upload multi-GB files generated at the speed of memory 
copies, I think that is at both scale. If you are generating data faster than 
it can be uploaded, you will OOM.


Small datasets running in-EC2 on large instances, or installations where you 
have a local object store supporting S3 API, you should get away with it. Bulk 
uploads over long-haul networks: no.

Keep an eye on : https://issues.apache.org/jira/browse/HADOOP-13560




Re: Real Time Recommendation Engines with Spark and Scala

2016-09-05 Thread Mich Talebzadeh
I suppose we are all looking at some sort of use case for recommendation
engine, whether that is commodities or some Financial Instrument.

They all depend on some form of criteria. The commodity or instrument does
not matter. It could be a new "Super Mario Wii" release or some shares that
are "performing well".

So if you are a big retailer shop you would be going and ordering 100s of
the new game or if you are a portfolio manager would be seeing a good "buy"
there.

So I guess the design would be pretty similar. Still there will be a batch
layer and speed layer and some presentation layer hooked to a faster
database like Druid or Hbase.

Cheers

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 5 September 2016 at 16:28, Mich Talebzadeh 
wrote:

> no problem. Got it Alonso
>
> Thanks
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 5 September 2016 at 15:59, Alonso Isidoro Roman 
> wrote:
>
>> i would like to work, not love :) sorry
>>
>> Alonso Isidoro Roman
>> [image: https://]about.me/alonso.isidoro.roman
>>
>> 
>>
>> 2016-09-05 16:58 GMT+02:00 Alonso Isidoro Roman :
>>
>>> By the way, i would love to work in your project, looks promising!
>>>
>>>
>>>
>>> Alonso Isidoro Roman
>>> [image: https://]about.me/alonso.isidoro.roman
>>>
>>> 
>>>
>>> 2016-09-05 16:57 GMT+02:00 Alonso Isidoro Roman :
>>>
 Hi Mitch,


1. What do you mean "my own rating" here? You know the products. So
what Amazon is going to provide by way of Kafka?

 The idea was to embed the functionality of a kafka producer within a
 rest service in order i can invoke this logic with my a rating. I did not
 create such functionality because i started to make another things, i get
 bored, basically. I created some unix commands with this code, using
 sbt-pack.



1. Assuming that you have created topic specifically for this
purpose then that topic is streamed into Kafka, some algorithms is 
 applied
and results are saved in DB

 you got it!

1. You have some dashboard that will fetch data (via ???) from the
DB and I guess Zeppelin can do it here?


 No, not any dashboard yet. Maybe it is not a good idea to connect
 mongodb with a dashboard through a web socket. It probably works, for a
 proof of concept, but, in a real project? i don't know yet...

 You can see what i did to push data within a kafka topic in this scala
 class
 ,
 you have to invoke pack within the scala shell to create this unix command.

 Regards!

 Alonso



 Alonso Isidoro Roman
 [image: https://]about.me/alonso.isidoro.roman

 

 2016-09-05 16:39 GMT+02:00 Mich Talebzadeh :

> Thank you Alonso,
>
> I looked at your project. Interesting
>
> As I see it this is what you are suggesting
>
>
>1. A kafka producer is going to ask periodically to Amazon in
>order to know what products based on my own ratings and i am going to
>introduced them into some kafka topic.
>2. A spark streaming process is going to read from that previous
>topic.
>3. Apply some machine learning algorithms (ALS, content based
>filtering colaborative filtering) on those 

Spark 2.0.0 Thrift Server problem with Hive metastore

2016-09-05 Thread Campagnola, Francesco
Hi,

in an already working Spark - Hive environment with Spark 1.6 and Hive 1.2.1, 
with Hive metastore configured on Postgres DB, I have upgraded Spark to the 
2.0.0.

I have started the thrift server on YARN, then tried to execute from the 
beeline cli or a jdbc client the following command:
SHOW DATABASES;
It always gives this error on Spark server side:

spark@spark-test[spark] /home/spark> beeline -u jdbc:hive2://$(hostname):1 
-n spark

Connecting to jdbc:hive2://spark-test:1
16/09/05 17:41:43 INFO jdbc.Utils: Supplied authorities: spark-test:1
16/09/05 17:41:43 INFO jdbc.Utils: Resolved authority: spark-test:1
16/09/05 17:41:43 INFO jdbc.HiveConnection: Will try to open client transport 
with JDBC Uri: jdbc:hive2:// spark-test:1
Connected to: Spark SQL (version 2.0.0)
Driver: Hive JDBC (version 1.2.1.spark2)
Transaction isolation: TRANSACTION_REPEATABLE_READ
Beeline version 1.2.1.spark2 by Apache Hive

0: jdbc:hive2:// spark-test:1> show databases;
java.lang.IllegalStateException: Can't overwrite cause with 
java.lang.ClassCastException: 
org.apache.spark.sql.catalyst.expressions.GenericInternalRow cannot be cast to 
org.apache.spark.sql.catalyst.expressions.UnsafeRow
at java.lang.Throwable.initCause(Throwable.java:457)
at 
org.apache.hive.service.cli.HiveSQLException.toStackTrace(HiveSQLException.java:236)
at 
org.apache.hive.service.cli.HiveSQLException.toStackTrace(HiveSQLException.java:236)
at 
org.apache.hive.service.cli.HiveSQLException.toCause(HiveSQLException.java:197)
at 
org.apache.hive.service.cli.HiveSQLException.(HiveSQLException.java:108)
at org.apache.hive.jdbc.Utils.verifySuccess(Utils.java:256)
at org.apache.hive.jdbc.Utils.verifySuccessWithInfo(Utils.java:242)
at 
org.apache.hive.jdbc.HiveQueryResultSet.next(HiveQueryResultSet.java:365)
at org.apache.hive.beeline.BufferedRows.(BufferedRows.java:42)
at org.apache.hive.beeline.BeeLine.print(BeeLine.java:1794)
at org.apache.hive.beeline.Commands.execute(Commands.java:860)
at org.apache.hive.beeline.Commands.sql(Commands.java:713)
at org.apache.hive.beeline.BeeLine.dispatch(BeeLine.java:973)
at org.apache.hive.beeline.BeeLine.execute(BeeLine.java:813)
at org.apache.hive.beeline.BeeLine.begin(BeeLine.java:771)
at 
org.apache.hive.beeline.BeeLine.mainWithInputRedirection(BeeLine.java:484)
at org.apache.hive.beeline.BeeLine.main(BeeLine.java:467)
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
Task 0 in stage 3.0 failed 10 times, most recent failure: Lost task 0.9 in 
stage 3.0 (TID 12, vertica204): java.lang.ClassCastException: 
org.apache.spark.sql.catalyst.expressions.GenericInternalRow cannot be cast to 
org.apache.spark.sql.catalyst.expressions.UnsafeRow
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:247)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:85)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

Driver stacktrace:
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
at 
org.apache.hive.service.cli.HiveSQLException.newInstance(HiveSQLException.java:244)
at 
org.apache.hive.service.cli.HiveSQLException.toStackTrace(HiveSQLException.java:210)
... 15 more
Error: Error retrieving next row (state=,code=0)

The same command works when using Spark 1.6, is it a possible issue?

Thanks!


Re: Splitting columns from a text file

2016-09-05 Thread Ashok Kumar
Thanks everyone.
I am not skilled like you gentlemen
This is what I did
1) Read the text file
val textFile = sc.textFile("/tmp/myfile.txt")

2) That produces an RDD of String.
3) Create a DF after splitting the file into an Array 
val df = textFile.map(line => 
line.split(",")).map(x=>(x(0).toInt,x(1).toString,x(2).toDouble)).toDF
4) Create a class for column headers
 case class Columns(col1: Int, col2: String, col3: Double)
5) Assign the column headers 
val h = df.map(p => Columns(p(0).toString.toInt, p(1).toString, 
p(2).toString.toDouble))
6) Only interested in column 3 > 50
 h.filter(col("Col3") > 50.0)
7) Now I just want Col3 only
h.filter(col("Col3") > 50.0).select("col3").show(5)+-+|         
    
col3|+-+|95.42536350467836||61.56297588648554||76.73982017179868||68.86218120274728||67.64613810115105|+-+only
 showing top 5 rows
Does that make sense. Are there shorter ways gurus? Can I just do all this on 
RDD without DF?
Thanking you




 

On Monday, 5 September 2016, 15:19, ayan guha  wrote:
 

 Then, You need to refer third term in the array, convert it to your desired 
data type and then use filter. 

On Tue, Sep 6, 2016 at 12:14 AM, Ashok Kumar  wrote:

Hi,I want to filter them for values.
This is what is in array
74,20160905-133143,98. 11218069128827594148

I want to filter anything > 50.0 in the third column
Thanks

 

On Monday, 5 September 2016, 15:07, ayan guha  wrote:
 

 Hi
x.split returns an array. So, after first map, you will get RDD of arrays. What 
is your expected outcome of 2nd map? 
On Mon, Sep 5, 2016 at 11:30 PM, Ashok Kumar  
wrote:

Thank you sir.
This is what I get
scala> textFile.map(x=> x.split(","))res52: org.apache.spark.rdd.RDD[ 
Array[String]] = MapPartitionsRDD[27] at map at :27
How can I work on individual columns. I understand they are strings
scala> textFile.map(x=> x.split(",")).map(x => (x.getString(0))     | 
):27: error: value getString is not a member of Array[String]       
textFile.map(x=> x.split(",")).map(x => (x.getString(0))
regards

 

On Monday, 5 September 2016, 13:51, Somasundaram Sekar  wrote:
 

 Basic error, you get back an RDD on transformations like 
map.sc.textFile("filename").map(x => x.split(",") 
On 5 Sep 2016 6:19 pm, "Ashok Kumar"  wrote:

Hi,
I have a text file as below that I read in
74,20160905-133143,98. 1121806912882759414875,20160905-133143,49. 
5277699881591680774276,20160905-133143,56. 
0802995712398098455677,20160905-133143,46. 
636895265444075228,20160905-133143,84. 
8822714116440218155179,20160905-133143,68. 72408602520662115000
val textFile = sc.textFile("/tmp/mytextfile. txt")
Now I want to split the rows separated by ","
scala> textFile.map(x=>x.toString). split(","):27: error: value split 
is not a member of org.apache.spark.rdd.RDD[ String]       
textFile.map(x=>x.toString). split(",")
However, the above throws error?
Any ideas what is wrong or how I can do this if I can avoid converting it to 
String?
Thanking



   



-- 
Best Regards,
Ayan Guha


   



-- 
Best Regards,
Ayan Guha


   

Re: Real Time Recommendation Engines with Spark and Scala

2016-09-05 Thread Alonso Isidoro Roman
By the way, i would love to work in your project, looks promising!



Alonso Isidoro Roman
[image: https://]about.me/alonso.isidoro.roman


2016-09-05 16:57 GMT+02:00 Alonso Isidoro Roman :

> Hi Mitch,
>
>
>1. What do you mean "my own rating" here? You know the products. So
>what Amazon is going to provide by way of Kafka?
>
> The idea was to embed the functionality of a kafka producer within a rest
> service in order i can invoke this logic with my a rating. I did not create
> such functionality because i started to make another things, i get bored,
> basically. I created some unix commands with this code, using sbt-pack.
>
>
>
>1. Assuming that you have created topic specifically for this purpose
>then that topic is streamed into Kafka, some algorithms is applied and
>results are saved in DB
>
> you got it!
>
>1. You have some dashboard that will fetch data (via ???) from the DB
>and I guess Zeppelin can do it here?
>
>
> No, not any dashboard yet. Maybe it is not a good idea to connect mongodb
> with a dashboard through a web socket. It probably works, for a proof of
> concept, but, in a real project? i don't know yet...
>
> You can see what i did to push data within a kafka topic in this scala
> class
> ,
> you have to invoke pack within the scala shell to create this unix command.
>
> Regards!
>
> Alonso
>
>
>
> Alonso Isidoro Roman
> [image: https://]about.me/alonso.isidoro.roman
>
> 
>
> 2016-09-05 16:39 GMT+02:00 Mich Talebzadeh :
>
>> Thank you Alonso,
>>
>> I looked at your project. Interesting
>>
>> As I see it this is what you are suggesting
>>
>>
>>1. A kafka producer is going to ask periodically to Amazon in order
>>to know what products based on my own ratings and i am going to introduced
>>them into some kafka topic.
>>2. A spark streaming process is going to read from that previous
>>topic.
>>3. Apply some machine learning algorithms (ALS, content based
>>filtering colaborative filtering) on those datasets readed by the spark
>>streaming process.
>>4. Save results in a mongo or cassandra instance.
>>5. Use play framework to create an websocket interface between the
>>mongo instance and the visual interface.
>>
>>
>> As I understand
>>
>> Point 1: A kafka producer is going to ask periodically to Amazon in order
>> to know what products based on my own ratings .
>>
>>
>>1. What do you mean "my own rating" here? You know the products. So
>>what Amazon is going to provide by way of Kafka?
>>2. Assuming that you have created topic specifically for this purpose
>>then that topic is streamed into Kafka, some algorithms is applied and
>>results are saved in DB
>>3. You have some dashboard that will fetch data (via ???) from the DB
>>and I guess Zeppelin can do it here?
>>
>>
>> Do you Have a DFD diagram for your design in case. Something like below
>> (hope does not look pedantic, non intended).
>>
>> [image: Inline images 1]
>>
>>
>>
>>
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 5 September 2016 at 15:08, Alonso Isidoro Roman 
>> wrote:
>>
>>> Hi Mitch, i wrote few months ago a tiny project with this issue in mind.
>>> The idea is to apply ALS algorithm in order to get some valid
>>> recommendations from another users.
>>>
>>>
>>> The url of the project
>>> 
>>>
>>>
>>>
>>> Alonso Isidoro Roman
>>> [image: https://]about.me/alonso.isidoro.roman
>>>
>>> 
>>>
>>> 2016-09-05 15:41 GMT+02:00 Mich Talebzadeh :
>>>
 Hi,

 Has anyone done any work on Real time recommendation engines with Spark
 and Scala.

 I have seen few PPTs with Python but wanted to see if these have been
 done with Scala.

 I trust this question makes sense.

 Thanks

 p.s. My prime 

Re: Real Time Recommendation Engines with Spark and Scala

2016-09-05 Thread Alonso Isidoro Roman
Hi Mitch,


   1. What do you mean "my own rating" here? You know the products. So what
   Amazon is going to provide by way of Kafka?

The idea was to embed the functionality of a kafka producer within a rest
service in order i can invoke this logic with my a rating. I did not create
such functionality because i started to make another things, i get bored,
basically. I created some unix commands with this code, using sbt-pack.



   1. Assuming that you have created topic specifically for this purpose
   then that topic is streamed into Kafka, some algorithms is applied and
   results are saved in DB

you got it!

   1. You have some dashboard that will fetch data (via ???) from the DB
   and I guess Zeppelin can do it here?


No, not any dashboard yet. Maybe it is not a good idea to connect mongodb
with a dashboard through a web socket. It probably works, for a proof of
concept, but, in a real project? i don't know yet...

You can see what i did to push data within a kafka topic in this scala class
,
you have to invoke pack within the scala shell to create this unix command.

Regards!

Alonso



Alonso Isidoro Roman
[image: https://]about.me/alonso.isidoro.roman


2016-09-05 16:39 GMT+02:00 Mich Talebzadeh :

> Thank you Alonso,
>
> I looked at your project. Interesting
>
> As I see it this is what you are suggesting
>
>
>1. A kafka producer is going to ask periodically to Amazon in order to
>know what products based on my own ratings and i am going to introduced
>them into some kafka topic.
>2. A spark streaming process is going to read from that previous topic.
>3. Apply some machine learning algorithms (ALS, content based
>filtering colaborative filtering) on those datasets readed by the spark
>streaming process.
>4. Save results in a mongo or cassandra instance.
>5. Use play framework to create an websocket interface between the
>mongo instance and the visual interface.
>
>
> As I understand
>
> Point 1: A kafka producer is going to ask periodically to Amazon in order
> to know what products based on my own ratings .
>
>
>1. What do you mean "my own rating" here? You know the products. So
>what Amazon is going to provide by way of Kafka?
>2. Assuming that you have created topic specifically for this purpose
>then that topic is streamed into Kafka, some algorithms is applied and
>results are saved in DB
>3. You have some dashboard that will fetch data (via ???) from the DB
>and I guess Zeppelin can do it here?
>
>
> Do you Have a DFD diagram for your design in case. Something like below
> (hope does not look pedantic, non intended).
>
> [image: Inline images 1]
>
>
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 5 September 2016 at 15:08, Alonso Isidoro Roman 
> wrote:
>
>> Hi Mitch, i wrote few months ago a tiny project with this issue in mind.
>> The idea is to apply ALS algorithm in order to get some valid
>> recommendations from another users.
>>
>>
>> The url of the project
>> 
>>
>>
>>
>> Alonso Isidoro Roman
>> [image: https://]about.me/alonso.isidoro.roman
>>
>> 
>>
>> 2016-09-05 15:41 GMT+02:00 Mich Talebzadeh :
>>
>>> Hi,
>>>
>>> Has anyone done any work on Real time recommendation engines with Spark
>>> and Scala.
>>>
>>> I have seen few PPTs with Python but wanted to see if these have been
>>> done with Scala.
>>>
>>> I trust this question makes sense.
>>>
>>> Thanks
>>>
>>> p.s. My prime interest would be in Financial markets.
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> *
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from 

Re: Real Time Recommendation Engines with Spark and Scala

2016-09-05 Thread Mich Talebzadeh
Thank you Alonso,

I looked at your project. Interesting

As I see it this is what you are suggesting


   1. A kafka producer is going to ask periodically to Amazon in order to
   know what products based on my own ratings and i am going to introduced
   them into some kafka topic.
   2. A spark streaming process is going to read from that previous topic.
   3. Apply some machine learning algorithms (ALS, content based filtering
   colaborative filtering) on those datasets readed by the spark streaming
   process.
   4. Save results in a mongo or cassandra instance.
   5. Use play framework to create an websocket interface between the mongo
   instance and the visual interface.


As I understand

Point 1: A kafka producer is going to ask periodically to Amazon in order
to know what products based on my own ratings .


   1. What do you mean "my own rating" here? You know the products. So what
   Amazon is going to provide by way of Kafka?
   2. Assuming that you have created topic specifically for this purpose
   then that topic is streamed into Kafka, some algorithms is applied and
   results are saved in DB
   3. You have some dashboard that will fetch data (via ???) from the DB
   and I guess Zeppelin can do it here?


Do you Have a DFD diagram for your design in case. Something like below
(hope does not look pedantic, non intended).

[image: Inline images 1]






Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 5 September 2016 at 15:08, Alonso Isidoro Roman 
wrote:

> Hi Mitch, i wrote few months ago a tiny project with this issue in mind.
> The idea is to apply ALS algorithm in order to get some valid
> recommendations from another users.
>
>
> The url of the project
> 
>
>
>
> Alonso Isidoro Roman
> [image: https://]about.me/alonso.isidoro.roman
>
> 
>
> 2016-09-05 15:41 GMT+02:00 Mich Talebzadeh :
>
>> Hi,
>>
>> Has anyone done any work on Real time recommendation engines with Spark
>> and Scala.
>>
>> I have seen few PPTs with Python but wanted to see if these have been
>> done with Scala.
>>
>> I trust this question makes sense.
>>
>> Thanks
>>
>> p.s. My prime interest would be in Financial markets.
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>
>


Re: Splitting columns from a text file

2016-09-05 Thread ayan guha
Then, You need to refer third term in the array, convert it to your desired
data type and then use filter.


On Tue, Sep 6, 2016 at 12:14 AM, Ashok Kumar  wrote:

> Hi,
> I want to filter them for values.
>
> This is what is in array
>
> 74,20160905-133143,98.11218069128827594148
>
> I want to filter anything > 50.0 in the third column
>
> Thanks
>
>
>
>
> On Monday, 5 September 2016, 15:07, ayan guha  wrote:
>
>
> Hi
>
> x.split returns an array. So, after first map, you will get RDD of arrays.
> What is your expected outcome of 2nd map?
>
> On Mon, Sep 5, 2016 at 11:30 PM, Ashok Kumar  > wrote:
>
> Thank you sir.
>
> This is what I get
>
> scala> textFile.map(x=> x.split(","))
> res52: org.apache.spark.rdd.RDD[ Array[String]] = MapPartitionsRDD[27] at
> map at :27
>
> How can I work on individual columns. I understand they are strings
>
> scala> textFile.map(x=> x.split(",")).map(x => (x.getString(0))
>  | )
> :27: error: value getString is not a member of Array[String]
>textFile.map(x=> x.split(",")).map(x => (x.getString(0))
>
> regards
>
>
>
>
> On Monday, 5 September 2016, 13:51, Somasundaram Sekar  tigeranalytics.com > wrote:
>
>
> Basic error, you get back an RDD on transformations like map.
> sc.textFile("filename").map(x => x.split(",")
>
> On 5 Sep 2016 6:19 pm, "Ashok Kumar"  wrote:
>
> Hi,
>
> I have a text file as below that I read in
>
> 74,20160905-133143,98. 11218069128827594148
> 75,20160905-133143,49. 52776998815916807742
> 76,20160905-133143,56. 08029957123980984556
> 77,20160905-133143,46. 63689526544407522777
> 78,20160905-133143,84. 88227141164402181551
> 79,20160905-133143,68. 72408602520662115000
>
> val textFile = sc.textFile("/tmp/mytextfile. txt")
>
> Now I want to split the rows separated by ","
>
> scala> textFile.map(x=>x.toString). split(",")
> :27: error: value split is not a member of
> org.apache.spark.rdd.RDD[ String]
>textFile.map(x=>x.toString). split(",")
>
> However, the above throws error?
>
> Any ideas what is wrong or how I can do this if I can avoid converting it
> to String?
>
> Thanking
>
>
>
>
>
>
> --
> Best Regards,
> Ayan Guha
>
>
>


-- 
Best Regards,
Ayan Guha


Re: Splitting columns from a text file

2016-09-05 Thread Fridtjof Sander

Ask yourself how to access the third element in an array in Scala.


Am 05.09.2016 um 16:14 schrieb Ashok Kumar:

Hi,
I want to filter them for values.

This is what is in array

74,20160905-133143,98.11218069128827594148

I want to filter anything > 50.0 in the third column

Thanks




On Monday, 5 September 2016, 15:07, ayan guha  wrote:


Hi

x.split returns an array. So, after first map, you will get RDD of 
arrays. What is your expected outcome of 2nd map?


On Mon, Sep 5, 2016 at 11:30 PM, Ashok Kumar 
> 
wrote:


Thank you sir.

This is what I get

scala> textFile.map(x=> x.split(","))
res52: org.apache.spark.rdd.RDD[ Array[String]] =
MapPartitionsRDD[27] at map at :27

How can I work on individual columns. I understand they are strings

scala> textFile.map(x=> x.split(",")).map(x => (x.getString(0))
 | )
:27: error: value getString is not a member of Array[String]
 textFile.map(x=> x.split(",")).map(x => (x.getString(0))

regards




On Monday, 5 September 2016, 13:51, Somasundaram Sekar
> wrote:


Basic error, you get back an RDD on transformations like map.
sc.textFile("filename").map(x => x.split(",")

On 5 Sep 2016 6:19 pm, "Ashok Kumar"
 wrote:

Hi,

I have a text file as below that I read in

74,20160905-133143,98. 11218069128827594148
75,20160905-133143,49. 52776998815916807742
76,20160905-133143,56. 08029957123980984556
77,20160905-133143,46. 63689526544407522777
78,20160905-133143,84. 88227141164402181551
79,20160905-133143,68. 72408602520662115000

val textFile = sc.textFile("/tmp/mytextfile. txt")

Now I want to split the rows separated by ","

scala> textFile.map(x=>x.toString). split(",")
:27: error: value split is not a member of
org.apache.spark.rdd.RDD[ String]
 textFile.map(x=>x.toString). split(",")

However, the above throws error?

Any ideas what is wrong or how I can do this if I can avoid
converting it to String?

Thanking






--
Best Regards,
Ayan Guha






Re: Splitting columns from a text file

2016-09-05 Thread Ashok Kumar
Hi,I want to filter them for values.
This is what is in array
74,20160905-133143,98.11218069128827594148

I want to filter anything > 50.0 in the third column
Thanks

 

On Monday, 5 September 2016, 15:07, ayan guha  wrote:
 

 Hi
x.split returns an array. So, after first map, you will get RDD of arrays. What 
is your expected outcome of 2nd map? 
On Mon, Sep 5, 2016 at 11:30 PM, Ashok Kumar  
wrote:

Thank you sir.
This is what I get
scala> textFile.map(x=> x.split(","))res52: org.apache.spark.rdd.RDD[ 
Array[String]] = MapPartitionsRDD[27] at map at :27
How can I work on individual columns. I understand they are strings
scala> textFile.map(x=> x.split(",")).map(x => (x.getString(0))     | 
):27: error: value getString is not a member of Array[String]       
textFile.map(x=> x.split(",")).map(x => (x.getString(0))
regards

 

On Monday, 5 September 2016, 13:51, Somasundaram Sekar  wrote:
 

 Basic error, you get back an RDD on transformations like 
map.sc.textFile("filename").map(x => x.split(",") 
On 5 Sep 2016 6:19 pm, "Ashok Kumar"  wrote:

Hi,
I have a text file as below that I read in
74,20160905-133143,98. 1121806912882759414875,20160905-133143,49. 
5277699881591680774276,20160905-133143,56. 
0802995712398098455677,20160905-133143,46. 
636895265444075228,20160905-133143,84. 
8822714116440218155179,20160905-133143,68. 72408602520662115000
val textFile = sc.textFile("/tmp/mytextfile. txt")
Now I want to split the rows separated by ","
scala> textFile.map(x=>x.toString). split(","):27: error: value split 
is not a member of org.apache.spark.rdd.RDD[ String]       
textFile.map(x=>x.toString). split(",")
However, the above throws error?
Any ideas what is wrong or how I can do this if I can avoid converting it to 
String?
Thanking



   



-- 
Best Regards,
Ayan Guha


   

Re: Real Time Recommendation Engines with Spark and Scala

2016-09-05 Thread Alonso Isidoro Roman
Hi Mitch, i wrote few months ago a tiny project with this issue in mind.
The idea is to apply ALS algorithm in order to get some valid
recommendations from another users.


The url of the project




Alonso Isidoro Roman
[image: https://]about.me/alonso.isidoro.roman


2016-09-05 15:41 GMT+02:00 Mich Talebzadeh :

> Hi,
>
> Has anyone done any work on Real time recommendation engines with Spark
> and Scala.
>
> I have seen few PPTs with Python but wanted to see if these have been done
> with Scala.
>
> I trust this question makes sense.
>
> Thanks
>
> p.s. My prime interest would be in Financial markets.
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>


Re: Splitting columns from a text file

2016-09-05 Thread ayan guha
Hi

x.split returns an array. So, after first map, you will get RDD of arrays.
What is your expected outcome of 2nd map?

On Mon, Sep 5, 2016 at 11:30 PM, Ashok Kumar 
wrote:

> Thank you sir.
>
> This is what I get
>
> scala> textFile.map(x=> x.split(","))
> res52: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[27] at
> map at :27
>
> How can I work on individual columns. I understand they are strings
>
> scala> textFile.map(x=> x.split(",")).map(x => (x.getString(0))
>  | )
> :27: error: value getString is not a member of Array[String]
>textFile.map(x=> x.split(",")).map(x => (x.getString(0))
>
> regards
>
>
>
>
> On Monday, 5 September 2016, 13:51, Somasundaram Sekar  tigeranalytics.com> wrote:
>
>
> Basic error, you get back an RDD on transformations like map.
> sc.textFile("filename").map(x => x.split(",")
>
> On 5 Sep 2016 6:19 pm, "Ashok Kumar"  wrote:
>
> Hi,
>
> I have a text file as below that I read in
>
> 74,20160905-133143,98. 11218069128827594148
> 75,20160905-133143,49. 52776998815916807742
> 76,20160905-133143,56. 08029957123980984556
> 77,20160905-133143,46. 63689526544407522777
> 78,20160905-133143,84. 88227141164402181551
> 79,20160905-133143,68. 72408602520662115000
>
> val textFile = sc.textFile("/tmp/mytextfile. txt")
>
> Now I want to split the rows separated by ","
>
> scala> textFile.map(x=>x.toString). split(",")
> :27: error: value split is not a member of
> org.apache.spark.rdd.RDD[ String]
>textFile.map(x=>x.toString). split(",")
>
> However, the above throws error?
>
> Any ideas what is wrong or how I can do this if I can avoid converting it
> to String?
>
> Thanking
>
>
>
>


-- 
Best Regards,
Ayan Guha


Real Time Recommendation Engines with Spark and Scala

2016-09-05 Thread Mich Talebzadeh
Hi,

Has anyone done any work on Real time recommendation engines with Spark and
Scala.

I have seen few PPTs with Python but wanted to see if these have been done
with Scala.

I trust this question makes sense.

Thanks

p.s. My prime interest would be in Financial markets.


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.


Re: Splitting columns from a text file

2016-09-05 Thread Somasundaram Sekar
Please have a look at the documentation for information on how to work with
RDD. Start with this http://spark.apache.org/docs/latest/quick-start.html

On 5 Sep 2016 7:00 pm, "Ashok Kumar"  wrote:

> Thank you sir.
>
> This is what I get
>
> scala> textFile.map(x=> x.split(","))
> res52: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[27] at
> map at :27
>
> How can I work on individual columns. I understand they are strings
>
> scala> textFile.map(x=> x.split(",")).map(x => (x.getString(0))
>  | )
> :27: error: value getString is not a member of Array[String]
>textFile.map(x=> x.split(",")).map(x => (x.getString(0))
>
> regards
>
>
>
>
> On Monday, 5 September 2016, 13:51, Somasundaram Sekar  tigeranalytics.com> wrote:
>
>
> Basic error, you get back an RDD on transformations like map.
> sc.textFile("filename").map(x => x.split(",")
>
> On 5 Sep 2016 6:19 pm, "Ashok Kumar"  wrote:
>
> Hi,
>
> I have a text file as below that I read in
>
> 74,20160905-133143,98. 11218069128827594148
> 75,20160905-133143,49. 52776998815916807742
> 76,20160905-133143,56. 08029957123980984556
> 77,20160905-133143,46. 63689526544407522777
> 78,20160905-133143,84. 88227141164402181551
> 79,20160905-133143,68. 72408602520662115000
>
> val textFile = sc.textFile("/tmp/mytextfile. txt")
>
> Now I want to split the rows separated by ","
>
> scala> textFile.map(x=>x.toString). split(",")
> :27: error: value split is not a member of
> org.apache.spark.rdd.RDD[ String]
>textFile.map(x=>x.toString). split(",")
>
> However, the above throws error?
>
> Any ideas what is wrong or how I can do this if I can avoid converting it
> to String?
>
> Thanking
>
>
>
>


Re: Splitting columns from a text file

2016-09-05 Thread Ashok Kumar
Thank you sir.
This is what I get
scala> textFile.map(x=> x.split(","))res52: 
org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[27] at map at 
:27
How can I work on individual columns. I understand they are strings
scala> textFile.map(x=> x.split(",")).map(x => (x.getString(0))     | 
):27: error: value getString is not a member of Array[String]       
textFile.map(x=> x.split(",")).map(x => (x.getString(0))
regards

 

On Monday, 5 September 2016, 13:51, Somasundaram Sekar 
 wrote:
 

 Basic error, you get back an RDD on transformations like 
map.sc.textFile("filename").map(x => x.split(",") 
On 5 Sep 2016 6:19 pm, "Ashok Kumar"  wrote:

Hi,
I have a text file as below that I read in
74,20160905-133143,98. 1121806912882759414875,20160905-133143,49. 
5277699881591680774276,20160905-133143,56. 
0802995712398098455677,20160905-133143,46. 
636895265444075228,20160905-133143,84. 
8822714116440218155179,20160905-133143,68. 72408602520662115000
val textFile = sc.textFile("/tmp/mytextfile. txt")
Now I want to split the rows separated by ","
scala> textFile.map(x=>x.toString). split(","):27: error: value split 
is not a member of org.apache.spark.rdd.RDD[ String]       
textFile.map(x=>x.toString). split(",")
However, the above throws error?
Any ideas what is wrong or how I can do this if I can avoid converting it to 
String?
Thanking



   

Re: Splitting columns from a text file

2016-09-05 Thread Somasundaram Sekar
Basic error, you get back an RDD on transformations like map.

sc.textFile("filename").map(x => x.split(",")

On 5 Sep 2016 6:19 pm, "Ashok Kumar"  wrote:

> Hi,
>
> I have a text file as below that I read in
>
> 74,20160905-133143,98.11218069128827594148
> 75,20160905-133143,49.52776998815916807742
> 76,20160905-133143,56.08029957123980984556
> 77,20160905-133143,46.63689526544407522777
> 78,20160905-133143,84.88227141164402181551
> 79,20160905-133143,68.72408602520662115000
>
> val textFile = sc.textFile("/tmp/mytextfile.txt")
>
> Now I want to split the rows separated by ","
>
> scala> textFile.map(x=>x.toString).split(",")
> :27: error: value split is not a member of
> org.apache.spark.rdd.RDD[String]
>textFile.map(x=>x.toString).split(",")
>
> However, the above throws error?
>
> Any ideas what is wrong or how I can do this if I can avoid converting it
> to String?
>
> Thanking
>
>


Splitting columns from a text file

2016-09-05 Thread Ashok Kumar
Hi,
I have a text file as below that I read in
74,20160905-133143,98.1121806912882759414875,20160905-133143,49.5277699881591680774276,20160905-133143,56.0802995712398098455677,20160905-133143,46.636895265444075228,20160905-133143,84.8822714116440218155179,20160905-133143,68.72408602520662115000
val textFile = sc.textFile("/tmp/mytextfile.txt")
Now I want to split the rows separated by ","
scala> textFile.map(x=>x.toString).split(","):27: error: value split 
is not a member of org.apache.spark.rdd.RDD[String]       
textFile.map(x=>x.toString).split(",")
However, the above throws error?
Any ideas what is wrong or how I can do this if I can avoid converting it to 
String?
Thanking


SPARK ML- Feature Selection Techniques

2016-09-05 Thread Bahubali Jain
Hi,
Do we have any feature selection techniques implementation(wrapper
methods,embedded methods) available in SPARK ML ?

Thanks,
Baahu
-- 
Twitter:http://twitter.com/Baahu


Re: Why there is no top method in dataset api

2016-09-05 Thread Sean Owen
​No, ​
I'm not advising you to use .rdd, just saying it is possible.
​Although I'd only use RDDs if you had a good reason to, given Datasets
now, they are not gone or even deprecated.​

You do not need to order the whole data set to get the top eleme
​nt. That isn't what top does though. You might be interested to look at
the source code. Nor is it what orderBy does if the optimizer is any good.

​Computing .rdd doesn't materialize an RDD. It involves some non-zero
overhead in creating a plan, which should be minor compared to execution.
So would any computation of "top N" on a Dataset, so I don't think this is
relevant.


​orderBy + take is already the way to accomplish "Dataset.top". It works on
Datasets, and therefore DataFrames too, for the reason you give. I'm not
sure what you're asking there.


On Mon, Sep 5, 2016, 13:01 Jakub Dubovsky 
wrote:

> Thanks Sean,
>
> I was under impression that spark creators are trying to persuade user
> community not to use RDD api directly. Spark summit I attended was full of
> this. So I am a bit surprised that I hear use-rdd-api as an advice from
> you. But if this is a way then I have a second question. For conversion
> from dataset to rdd I would use Dataset.rdd lazy val. Since it is a lazy
> val it suggests there is some computation going on to create rdd as a copy.
> The question is how much computationally expansive is this conversion? If
> there is a significant overhead then it is clear why one would want to have
> top method directly on Dataset class.
>
> Ordering whole dataset only to take first 10 or so top records is not
> really an acceptable option for us. Comparison function can be expansive
> and the size of dataset is (unsurprisingly) big.
>
> To be honest I do not really understand what do you mean by b). Since
> DataFrame is now only an alias for Dataset[Row] what do you mean by
> "DataFrame-like counterpart"?
>
> Thanks
>
> On Thu, Sep 1, 2016 at 2:31 PM, Sean Owen  wrote:
>
>> You can always call .rdd.top(n) of course. Although it's slightly
>> clunky, you can also .orderBy($"value".desc).take(n). Maybe there's an
>> easier way.
>>
>> I don't think if there's a strong reason other than it wasn't worth it
>> to write this and many other utility wrappers that a) already exist on
>> the underlying RDD API if you want them, and b) have a DataFrame-like
>> counterpart already that doesn't really need wrapping in a different
>> API.
>>
>> On Thu, Sep 1, 2016 at 12:53 PM, Jakub Dubovsky
>>  wrote:
>> > Hey all,
>> >
>> > in RDD api there is very usefull method called top. It finds top n
>> records
>> > in according to certain ordering without sorting all records. Very
>> usefull!
>> >
>> > There is no top method nor similar functionality in Dataset api. Has
>> anybody
>> > any clue why? Is there any specific reason for this?
>> >
>> > Any thoughts?
>> >
>> > thanks
>> >
>> > Jakub D.
>>
>
>


Re: Why there is no top method in dataset api

2016-09-05 Thread Jakub Dubovsky
Thanks Sean,

I was under impression that spark creators are trying to persuade user
community not to use RDD api directly. Spark summit I attended was full of
this. So I am a bit surprised that I hear use-rdd-api as an advice from
you. But if this is a way then I have a second question. For conversion
from dataset to rdd I would use Dataset.rdd lazy val. Since it is a lazy
val it suggests there is some computation going on to create rdd as a copy.
The question is how much computationally expansive is this conversion? If
there is a significant overhead then it is clear why one would want to have
top method directly on Dataset class.

Ordering whole dataset only to take first 10 or so top records is not
really an acceptable option for us. Comparison function can be expansive
and the size of dataset is (unsurprisingly) big.

To be honest I do not really understand what do you mean by b). Since
DataFrame is now only an alias for Dataset[Row] what do you mean by
"DataFrame-like counterpart"?

Thanks

On Thu, Sep 1, 2016 at 2:31 PM, Sean Owen  wrote:

> You can always call .rdd.top(n) of course. Although it's slightly
> clunky, you can also .orderBy($"value".desc).take(n). Maybe there's an
> easier way.
>
> I don't think if there's a strong reason other than it wasn't worth it
> to write this and many other utility wrappers that a) already exist on
> the underlying RDD API if you want them, and b) have a DataFrame-like
> counterpart already that doesn't really need wrapping in a different
> API.
>
> On Thu, Sep 1, 2016 at 12:53 PM, Jakub Dubovsky
>  wrote:
> > Hey all,
> >
> > in RDD api there is very usefull method called top. It finds top n
> records
> > in according to certain ordering without sorting all records. Very
> usefull!
> >
> > There is no top method nor similar functionality in Dataset api. Has
> anybody
> > any clue why? Is there any specific reason for this?
> >
> > Any thoughts?
> >
> > thanks
> >
> > Jakub D.
>


Re[8]: Spark 2.0: SQL runs 5x times slower when adding 29th field to aggregation.

2016-09-05 Thread Сергей Романов

Hi, Gavin,

Shuffling is exactly the same in both requests and is minimal. Both requests 
produces one shuffle task. Running time is the only difference I can see in 
metrics:

timeit.timeit(spark.read.csv('file:///data/dump/test_csv', 
schema=schema).groupBy().sum(*(['dd_convs'] * 57) ).collect, number=1)
0.713730096817
 {
    "id" : 368,
    "name" : "duration total (min, med, max)",
    "value" : "524"
  }, {
    "id" : 375,
    "name" : "internal.metrics.executorRunTime",
    "value" : "527"
  }, {
    "id" : 391,
    "name" : "internal.metrics.shuffle.write.writeTime",
    "value" : "244495"
  }

timeit.timeit(spark.read.csv('file:///data/dump/test_csv', 
schema=schema).groupBy().sum(*(['dd_convs'] * 58) ).collect, number=1)
2.97951102257

  }, {
    "id" : 469,
    "name" : "duration total (min, med, max)",
    "value" : "2654"
  }, {
    "id" : 476,
    "name" : "internal.metrics.executorRunTime",
    "value" : "2661"
  }, {
    "id" : 492,
    "name" : "internal.metrics.shuffle.write.writeTime",
    "value" : "371883"
  }, {
Full metrics in attachment.
>Суббота,  3 сентября 2016, 19:53 +03:00 от Gavin Yue :
>
>Any shuffling? 
>
>
>On Sep 3, 2016, at 5:50 AM, Сергей Романов < romano...@inbox.ru.INVALID > 
>wrote:
>
>>Same problem happens with CSV data file, so it's not parquet-related either.
>>
>>Welcome to
>>    __
>> / __/__  ___ _/ /__
>>    _\ \/ _ \/ _ `/ __/  '_/
>>   /__ / .__/\_,_/_/ /_/\_\   version 2.0.0
>>  /_/
>>
>>Using Python version 2.7.6 (default, Jun 22 2015 17:58:13)
>>SparkSession available as 'spark'.
> import timeit
> from pyspark.sql.types import *
> schema = StructType([StructField('dd_convs', FloatType(), True)])
> for x in range(50, 70): print x, 
> timeit.timeit(spark.read.csv('file:///data/dump/test_csv', 
> schema=schema).groupBy().sum(*(['dd_convs'] * x) ).collect, number=1)
>>50 0.372850894928
>>51 0.376906871796
>>52 0.381325960159
>>53 0.385444164276
>>54 0.38685192
>>55 0.388918161392
>>56 0.397624969482
>>57 0.391713142395
>>58 2.62714004517
>>59 2.68421196938
>>60 2.74627685547
>>61 2.81081581116
>>62 3.43532109261
>>63 3.07742786407
>>64 3.03904604912
>>65 3.01616096497
>>66 3.06293702126
>>67 3.09386610985
>>68 3.27610206604
>>69 3.2041969299 Суббота,  3 сентября 2016, 15:40 +03:00 от Сергей Романов < 
>>romano...@inbox.ru.INVALID >:
>>>
>>>Hi,
>>>I had narrowed down my problem to a very simple case. I'm sending 27kb 
>>>parquet in attachment. (file:///data/dump/test2 in example)
>>>Please, can you take a look at it? Why there is performance drop after 57 
>>>sum columns?
>>>Welcome to
>>>    __
>>> / __/__  ___ _/ /__
>>>    _\ \/ _ \/ _ `/ __/  '_/
>>>   /__ / .__/\_,_/_/ /_/\_\   version 2.0.0
>>>  /_/
>>>
>>>Using Python version 2.7.6 (default, Jun 22 2015 17:58:13)
>>>SparkSession available as 'spark'.
>> import timeit
>> for x in range(70): print x, 
>> timeit.timeit(spark.read.parquet('file:///data/dump/test2').groupBy().sum(*(['dd_convs']
>>  * x) ).collect, number=1)
>>>... 
>>>SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
>>>SLF4J: Defaulting to no-operation (NOP) logger implementation
>>>SLF4J: See  http://www.slf4j.org/codes.html#StaticLoggerBinder for further 
>>>details.
>>>0 1.05591607094
>>>1 0.200426101685
>>>2 0.203800916672
>>>3 0.176458120346
>>>4 0.184863805771
>>>5 0.232321023941
>>>6 0.216032981873
>>>7 0.201778173447
>>>8 0.292424917221
>>>9 0.228524923325
>>>10 0.190534114838
>>>11 0.197028160095
>>>12 0.270443916321
>>>13 0.429781913757
>>>14 0.270851135254
>>>15 0.776989936829
>>>16 0.27879181
>>>17 0.227638959885
>>>18 0.212944030762
>>>19 0.2144780159
>>>20 0.22200012207
>>>21 0.262261152267
>>>22 0.254227876663
>>>23 0.275084018707
>>>24 0.292124032974
>>>25 0.280488014221
>>>16/09/03 15:31:28 WARN Utils: Truncated the string representation of a plan 
>>>since it was too large. This behavior can be adjusted by setting 
>>>'spark.debug.maxToStringFields' in SparkEnv.conf.
>>>26 0.290093898773
>>>27 0.238478899002
>>>28 0.246420860291
>>>29 0.241401195526
>>>30 0.255286931992
>>>31 0.42702794075
>>>32 0.327946186066
>>>33 0.434395074844
>>>34 0.314198970795
>>>35 0.34576010704
>>>36 0.278323888779
>>>37 0.289474964142
>>>38 0.290827989578
>>>39 0.376291036606
>>>40 0.347742080688
>>>41 0.363158941269
>>>42 0.318687915802
>>>43 0.376327991486
>>>44 0.374994039536
>>>45 0.362971067429
>>>46 0.425967931747
>>>47 0.370860099792
>>>48 0.443903923035
>>>49 0.374128103256
>>>50 0.378985881805
>>>51 0.476850986481
>>>52 0.451028823853
>>>53 0.432540893555
>>>54 0.514838933945
>>>55 0.53990483284
>>>56 0.449142932892
>>>57 0.465240001678 // 5x slower after 57 columns
>>>58 2.40412116051
>>>59 2.41632795334
>>>60 2.41812801361
>>>61 2.55726218224
>>>62 2.55484509468
>>>63 2.56128406525
>>>64 2.54642391205
>>>65 2.56381797791
>>>66 2.56871509552
>>>67 2.66187620163
>>>68 2.63496208191
>>>69 

回复:[SparkSQL+SparkStreaming]SparkStreaming APP can not load data into SparkSQL table

2016-09-05 Thread luohui20001

the data can be written as parquet into HDFS. But the loading data process is 
not working as expected.



 

ThanksBest regards!
San.Luo

- 原始邮件 -
发件人:
收件人:"user" 
主题:[SparkSQL+SparkStreaming]SparkStreaming APP can not load data into SparkSQL 
table
日期:2016年09月05日 18点55分

hi guys: I got a question that  my SparkStreaming APP can not loading data 
into SparkSQL table in. Here is my code:
val conf = new SparkConf().setAppName("KafkaStreaming for " + 
topics).setMaster("spark://master60:7077")
val storageLevel = StorageLevel.DISK_ONLY
val ssc = new StreamingContext(conf, Seconds(batchInterval.toInt))
val topicMap = topics.split(",").map((_, numThreads.toInt)).toMap
//Receiver-based 
val kafkaStream = KafkaUtils.createStream(ssc, zkQuorum, group, topicMap, 
storageLevel)

kafkaStream.foreachRDD { rdd =>
  val x = rdd.count()
  println(s"processing $x records=")
  rdd.collect().foreach(println)
  val sqlContext = SQLContext.getOrCreate(rdd.sparkContext)
  import sqlContext.implicits._
  val logRDD = 
sqlContext.read.json(rdd.values).select("payload").map(_.mkString)
  val logRDD2 = logRDD.map(_.split(',')).map { x =>
NginxLog(x(0).trim().toFloat.toInt,
  x(1).trim(),
  x(2).trim(),
  x(3).trim(),
  x(4).trim(),
  x(5).trim(),
  x(6).trim(),
  x(7).trim(),
  x(8).trim(),
  x(9).trim(),
  x(10).trim())
  }
  val recDF = logRDD2.toDF
  recDF.printSchema()

  val hc = new org.apache.spark.sql.hive.HiveContext(rdd.sparkContext)
  val index = rdd.id
  recDF.write.parquet(s"/etl/tables/nginxlog/${topicNO}/${index}")
  hc.sql("CREATE TABLE IF NOT EXISTS nginxlog(msec Int,remote_addr 
String,u_domain String,u_url String,u_title String,u_referrer String,u_sh 
String,u_sw String,u_cd String,u_lang String,u_utrace String) STORED AS 
PARQUET")  hc.sql(s"LOAD DATA INPATH 
'/etl/tables/nginxlog/${topicNO}/${index}' INTO TABLE nginxlog")}

There isn't any exception during running my APP. however, except the data in 
the first batch could be loaded into table nginxlog, all other batches can not 
be successfully loaded.I can not understand the reason of this kind of 
behavior. Is that my (hc)hivecontext issue?
PS.my spark cluster version: 1.6.1






 

ThanksBest regards!
San.Luo


[SparkSQL+SparkStreaming]SparkStreaming APP can not load data into SparkSQL table

2016-09-05 Thread luohui20001
hi guys: I got a question that  my SparkStreaming APP can not loading data 
into SparkSQL table in. Here is my code:
val conf = new SparkConf().setAppName("KafkaStreaming for " + 
topics).setMaster("spark://master60:7077")
val storageLevel = StorageLevel.DISK_ONLY
val ssc = new StreamingContext(conf, Seconds(batchInterval.toInt))
val topicMap = topics.split(",").map((_, numThreads.toInt)).toMap
//Receiver-based 
val kafkaStream = KafkaUtils.createStream(ssc, zkQuorum, group, topicMap, 
storageLevel)

kafkaStream.foreachRDD { rdd =>
  val x = rdd.count()
  println(s"processing $x records=")
  rdd.collect().foreach(println)
  val sqlContext = SQLContext.getOrCreate(rdd.sparkContext)
  import sqlContext.implicits._
  val logRDD = 
sqlContext.read.json(rdd.values).select("payload").map(_.mkString)
  val logRDD2 = logRDD.map(_.split(',')).map { x =>
NginxLog(x(0).trim().toFloat.toInt,
  x(1).trim(),
  x(2).trim(),
  x(3).trim(),
  x(4).trim(),
  x(5).trim(),
  x(6).trim(),
  x(7).trim(),
  x(8).trim(),
  x(9).trim(),
  x(10).trim())
  }
  val recDF = logRDD2.toDF
  recDF.printSchema()

  val hc = new org.apache.spark.sql.hive.HiveContext(rdd.sparkContext)
  val index = rdd.id
  recDF.write.parquet(s"/etl/tables/nginxlog/${topicNO}/${index}")
  hc.sql("CREATE TABLE IF NOT EXISTS nginxlog(msec Int,remote_addr 
String,u_domain String,u_url String,u_title String,u_referrer String,u_sh 
String,u_sw String,u_cd String,u_lang String,u_utrace String) STORED AS 
PARQUET")  hc.sql(s"LOAD DATA INPATH 
'/etl/tables/nginxlog/${topicNO}/${index}' INTO TABLE nginxlog")}

There isn't any exception during running my APP. however, except the data in 
the first batch could be loaded into table nginxlog, all other batches can not 
be successfully loaded.I can not understand the reason of this kind of 
behavior. Is that my (hc)hivecontext issue?
PS.my spark cluster version: 1.6.1






 

ThanksBest regards!
San.Luo


Re: Unable to get raw probabilities after clearing model threshold

2016-09-05 Thread kundan kumar
Sorry, my bad.

The issue got resolved.

Thanks,
Kundan

On Mon, Sep 5, 2016 at 3:58 PM, kundan kumar  wrote:

> Hi,
>
> I am unable to get the raw probabilities despite of clearing the
> threshold. Its still printing the predicted label.
>
> Can someone help resolve this issue.
>
> Here is the code snippet.
>
> LogisticRegressionWithSGD lrLearner = new LogisticRegressionWithSGD();
> LogisticRegressionModel model = lrLearner.run(labeledPointTrain.rdd());
> model.clearThreshold();
> JavaRDD> predictionAndLabels =
> labeledPointTrain.map(
> new Function>() {
> public Tuple2 call(LabeledPoint p) {
> Double prediction = model.predict(p.features());
> return new Tuple2(prediction, p.label());
> }
> }
> );
>
>
> predictionAndLabels.foreach(new VoidFunction>(){
>
> @Override
> public void call(Tuple2 pred) throws Exception {
> logger.error("PREDICTION:" + pred._1() + " ACTUAL LABEL:" + pred._2());
>
> }
> });
>
>
>
> Thanks,
> Kundan
>


Unable to get raw probabilities after clearing model threshold

2016-09-05 Thread kundan kumar
Hi,

I am unable to get the raw probabilities despite of clearing the threshold.
Its still printing the predicted label.

Can someone help resolve this issue.

Here is the code snippet.

LogisticRegressionWithSGD lrLearner = new LogisticRegressionWithSGD();
LogisticRegressionModel model = lrLearner.run(labeledPointTrain.rdd());
model.clearThreshold();
JavaRDD> predictionAndLabels = labeledPointTrain.map(
new Function>() {
public Tuple2 call(LabeledPoint p) {
Double prediction = model.predict(p.features());
return new Tuple2(prediction, p.label());
}
}
);


predictionAndLabels.foreach(new VoidFunction>(){

@Override
public void call(Tuple2 pred) throws Exception {
logger.error("PREDICTION:" + pred._1() + " ACTUAL LABEL:" + pred._2());

}
});



Thanks,
Kundan


Re: Is there anyway Spark UI is set to poll and refreshes itself

2016-09-05 Thread Mich Talebzadeh
Hi Sivakumaran

Thanks for your very useful research. Apologies have been very busy. Let me
read through and come back.

Regards


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 5 September 2016 at 10:21, Sivakumaran S  wrote:

> Hi Mich,
>
> Here is a summary of what I have so far understood (May require
> correction):
>
> *Facts*.
> 1. Spark has an internal web (http) server based on Jetty (
> http://www.eclipse.org/jetty/) that serves a ‘read-only’ UI on port 4040.
> Additional SCs are displayed in sequentially increased port numbers (4041,
> 4042 and so on)
> 2. Spark has a ReST API for job submissions.
> 3. The current UI has seven tabs: Jobs, Stages, Storage, Environment,
> Executors, Streaming and SQL.
> 4. Some tabs has further links pointing to greater depth of information
> about the job.
> 5. Changing Spark source code is not an option for building a dashboard
> because most users use the compiled binaries or custom solutions provided
> by cloud providers.
> 6. SparkListener can be used to monitor only from within the job
> submitted and is therefore job-specific.
>
> *Assumptions*
> 1. There is no API (ReST or otherwise) for retrieving statistics or
> information about current jobs.
> 2. There is no way of automatically refreshing the information in these
> tabs.
>
> *Proposed Solution to build a dashboard*
> 1. HTTP is stateless. Every call to the web server gives the current
> snapshot of information about the Spark job.
> 2. Python can be used to write this software.
> 3. So this proposed software sits between the Spark Server and the
> browser. It repetitively (say, every 5 seconds) gets data (the HTML pages)
>  from port 4040 and after parsing the required information from these HTML
> pages, updates the dashboard using Websocket. A light weight web server
> framework will also be required for hosting the dashboard.
> 4. Identifying what information is most important to be displayed (Some
> experienced Spark users can provide inputs) is important.
>
>
>
> If the folks at Spark provide a ReST API for retrieving important
> information about running jobs, this would be far more simpler.
>
> Regards,
>
> Sivakumaran S
>
>
>
> On 27-Aug-2016, at 3:59 PM, Mich Talebzadeh 
> wrote:
>
> Thanks Nguyen for the link.
>
> I installed Super Refresh as ADD on to Chrome. By default the refresh is
> stop until you set it to x seconds. However, the issue we have is that
> Spark UI comes with 6+ tabs and you have to repeat the process for each tab.
>
> However, that messes up the things. For example if I choose to refresh
> "Executors" tab every 2 seconds and decide to refresh "Stages" tab, then
> there is a race condition whereas you are thrown back to the last refresh
> page which is not really what one wants.
>
> Ideally one wants the Spark UI page identified by host:port to be the
> driver and every other tab underneath say host:port/Stages to be refreshed
> once we open that tab and stay there. If I go back to say SQL tab, I like
> to see that refreshed ever n seconds.
>
> I hope this makes sense.
>
>
>
> Dr Mich Talebzadeh
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
> http://talebzadehmich.wordpress.com
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 27 August 2016 at 15:01, nguyen duc Tuan  wrote:
>
>> The simplest solution that I found: using an browser extension which do
>> that for you :D. For example, if you are using Chrome, you can use this
>> extension: https://chrome.google.com/webstore/detail/easy-auto-refresh/
>> aabcgdmkeabbnleenpncegpcngjpnjkc/related?hl=en
>> An other way, but a bit more manually using javascript: start with a
>> window, you will create a child window with your target url. The parent
>> window will refresh that child window for you. Due to same-original
>> pollicy, you should set parent url to the same url as your target url. Try
>> this in your web console:
>> wi = window.open("your target url")
>> var timeInMinis = 2000
>> 

Re: How to detect when a JavaSparkContext gets stopped

2016-09-05 Thread Sean Owen
You can look into the SparkListener interface to get some of those
messages. Losing the master though is pretty fatal to all apps.

On Mon, Sep 5, 2016 at 7:30 AM, Hough, Stephen C  wrote:
> I have a long running application, configured to be HA, whereby only the
> designated leader will acquire a JavaSparkContext, listen for requests and
> push jobs onto this context.
>
>
>
> The problem I have is, whenever my AWS instances running workers die (either
> a time to live expires or I cancel those instances) it seems that Spark
> blames my driver, I see the following in logs.
>
>
>
> org.apache.spark.SparkException: Exiting due to error from cluster
> scheduler: Master removed our application: FAILED
>
>
>
> However my application doesn’t get a notification so thinks everything is
> okay, until it receives another request and tries to submit to the spark and
> gets a
>
>
>
> java.lang.IllegalStateException: Cannot call methods on a stopped
> SparkContext.
>
>
>
> Is there a way I can observe when the JavaSparkContext I own is stopped?
>
>
>
> Thanks
> Stephen
>
>
> This email and any attachments are confidential and may also be privileged.
> If you are not the intended recipient, please delete all copies and notify
> the sender immediately. You may wish to refer to the incorporation details
> of Standard Chartered PLC, Standard Chartered Bank and their subsidiaries at
> https://www.sc.com/en/incorporation-details.html

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



How to detect when a JavaSparkContext gets stopped

2016-09-05 Thread Hough, Stephen C
I have a long running application, configured to be HA, whereby only the 
designated leader will acquire a JavaSparkContext, listen for requests and push 
jobs onto this context.

The problem I have is, whenever my AWS instances running workers die (either a 
time to live expires or I cancel those instances) it seems that Spark blames my 
driver, I see the following in logs.

org.apache.spark.SparkException: Exiting due to error from cluster scheduler: 
Master removed our application: FAILED

However my application doesn't get a notification so thinks everything is okay, 
until it receives another request and tries to submit to the spark and gets a

java.lang.IllegalStateException: Cannot call methods on a stopped SparkContext.

Is there a way I can observe when the JavaSparkContext I own is stopped?

Thanks
Stephen

This email and any attachments are confidential and may also be privileged. If 
you are not the intended recipient, please delete all copies and notify the 
sender immediately. You may wish to refer to the incorporation details of 
Standard Chartered PLC, Standard Chartered Bank and their subsidiaries at 
https://www.sc.com/en/incorporation-details.html