Re: is there a way to persist the lineages generated by spark?

2017-04-03 Thread ayan guha
How about storing logical plans (or printDebugString, in case of RDD) to an
external file on the driver?

On Tue, Apr 4, 2017 at 1:19 PM, kant kodali  wrote:

> Hi All,
>
> I am wondering if there a way to persist the lineages generated by spark
> underneath? Some of our clients want us to prove if the result of the
> computation that we are showing on a dashboard is correct and for that If
> we can show the lineage of transformations that are executed to get to the
> result then that can be the Q.E.D moment but I am not even sure if this is
> even possible with spark?
>
> Thanks,
> kant
>



-- 
Best Regards,
Ayan Guha


is there a way to persist the lineages generated by spark?

2017-04-03 Thread kant kodali
Hi All,

I am wondering if there a way to persist the lineages generated by spark
underneath? Some of our clients want us to prove if the result of the
computation that we are showing on a dashboard is correct and for that If
we can show the lineage of transformations that are executed to get to the
result then that can be the Q.E.D moment but I am not even sure if this is
even possible with spark?

Thanks,
kant


map transform on array in spark sql

2017-04-03 Thread Koert Kuipers
i have a DataFrame where one column has type:

ArrayType(StructType(Seq(
  StructField("a", typeA, nullableA),
  StructField("b", typeB, nullableB)
)))

i would like to map over this array to pick the first element in the
struct. so the result should be a ArrayType(typeA, nullableA). i realize i
can do this with a scala udf if i know typeA. but what if i dont know typeA?

basically i would like to do an expression like:
map(col("x"), _(0)))

any suggestions?


Do we support excluding the CURRENT ROW in PARTITION BY windowing functions?

2017-04-03 Thread mathewwicks
Here is an example to illustrate my question. 

In this toy example, we are collecting a list of the other products that
each user has bought, and appending it as a new column. (Also note, that we
are filtering on some arbitrary column 'good_bad'.) 

I would like to know if we support NOT including the CURRENT ROW in the
OVER(PARTITION BY xxx) windowing function. 

For example, transaction 1 would have `other_purchases = [prod2, prod3]`
rather than `other_purchases = [prod1, prod2, prod3]`.

*--- Code Below ---*
df = spark.createDataFrame([ 
(1, "user1", "prod1", "good"), 
(2, "user1", "prod2", "good"), 
(3, "user1", "prod3", "good"), 
(4, "user2", "prod3", "bad"), 
(5, "user2", "prod4", "good"), 
(5, "user2", "prod5", "good")], 
("trans_id", "user_id", "prod_id", "good_bad") 
) 
df.show() 

df = df.selectExpr( 
"trans_id", 
"user_id", 
"COLLECT_LIST(CASE WHEN good_bad == 'good' THEN prod_id END)
OVER(PARTITION BY user_id) AS other_purchases" 
) 
df.show() 
**

Here is a stackoverflow link: 
https://stackoverflow.com/questions/43180723/spark-sql-excluding-the-current-row-in-partition-by-windowing-functions

  




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Do-we-support-excluding-the-CURRENT-ROW-in-PARTITION-BY-windowing-functions-tp28565.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Alternatives for dataframe collectAsList()

2017-04-03 Thread Paul Tremblay
What do you want to do with the results of the query?

Henry

On Wed, Mar 29, 2017 at 12:00 PM, szep.laszlo.it 
wrote:

> Hi,
>
> after I created a dataset
>
> Dataset df = sqlContext.sql("query");
>
> I need to have a result values and I call a method: collectAsList()
>
> List list = df.collectAsList();
>
> But it's very slow, if I work with large datasets (20-30 million records).
> I
> know, that the result isn't presented in driver app, that's why it takes
> long time, because collectAsList() collect all data from worker nodes.
>
> But then what is the right way to get result values? Is there an other
> solution to iterate over a result dataset rows, or get values? Can anyone
> post a small & working example?
>
> Thanks & Regards,
> Laszlo Szep
>
>
>
> --
> View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/Alternatives-for-dataframe-
> collectAsList-tp28547.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


-- 
Paul Henry Tremblay
Robert Half Technology


Re: Read file and represent rows as Vectors

2017-04-03 Thread Paul Tremblay
So if I am understanding your problem, you have the data in CSV files, but
the CSV files are gunzipped? If so Spark can read a gunzip file directly.
Sorry if I didn't understand your question.

Henry

On Mon, Apr 3, 2017 at 5:05 AM, Old-School 
wrote:

> I have a dataset that contains DocID, WordID and frequency (count) as shown
> below. Note that the first three numbers represent 1. the number of
> documents, 2. the number of words in the vocabulary and 3. the total number
> of words in the collection.
>
> 189
> 1430
> 12300
> 1 2 1
> 1 39 1
> 1 42 3
> 1 77 1
> 1 95 1
> 1 96 1
> 2 105 1
> 2 108 1
> 3 133 3
>
>
> What I want to do is to read the data (ignore the first three lines),
> combine the words per document and finally represent each document as a
> vector that contains the frequency of the wordID.
>
> Based on the above dataset the representation of documents 1, 2 and 3 will
> be (note that vocab_size can be extracted by the second line of the data):
>
> val data = Array(
> Vectors.sparse(vocab_size, Seq((2, 1.0), (39, 1.0), (42, 3.0), (77, 1.0),
> (95, 1.0), (96, 1.0))),
> Vectors.sparse(vocab_size, Seq((105, 1.0), (108, 1.0))),
> Vectors.sparse(vocab_size, Seq((133, 3.0
>
>
> The problem is that I am not quite sure how to read the .txt.gz file as RDD
> and create an Array of sparse vectors as described above. Please note that
> I
> actually want to pass the data array in the PCA transformer.
>
>
>
> --
> View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/Read-file-and-represent-rows-as-Vectors-tp28562.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


-- 
Paul Henry Tremblay
Robert Half Technology


_SUCCESS file validation on read

2017-04-03 Thread drewrobb
When writing a dataframe, a _SUCCESS file is created to mark that the entire
dataframe is written. However, the existence of this _SUCCESS does not seem
to be validated by default on reads. This would allow in some cases for
partially written dataframes to be read back. Is this behavior configurable?
Is lack of validation intentional?

Thanks!

Here is an example from spark 2.1.0 shell. I would expect the read step to
fail because I've manually removed the _SUCCESS file:

scala> spark.range(10).write.save("/tmp/test")

$ rm /tmp/test/_SUCCESS

scala> spark.read.parquet("/tmp/test").show()
+---+
| id|
+---+
|  8|
|  9|
|  3|
|  4|
|  5|
|  0|
|  6|
|  7|
|  2|
|  1|
+---+



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/SUCCESS-file-validation-on-read-tp28564.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Convert Dataframe to Dataset in pyspark

2017-04-03 Thread Michael Armbrust
You don't need encoders in python since its all dynamically typed anyway.
You can just do the following if you want the data as a string.

sqlContext.read.text("/home/spark/1.6/lines").rdd.map(lambda row: row.value)

2017-04-01 5:36 GMT-07:00 Selvam Raman :

> In Scala,
> val ds = sqlContext.read.text("/home/spark/1.6/lines").as[String]
>
> what is the equivalent code in pyspark?
>
> --
> Selvam Raman
> "லஞ்சம் தவிர்த்து நெஞ்சம் நிமிர்த்து"
>


Pyspark - pickle.PicklingError: Can't pickle

2017-04-03 Thread Selvam Raman
I ran the below code in my Standalone mode. Python version 2.7.6. Spacy
1.7+ version. Spark 2.0.1 version.

I'm new pie to pyspark. please help me to understand the below two versions
of code.

why first version run fine whereas second throws pickle.PicklingError:
Can't pickle . at 0x107e39110>.

(i was doubting that Second approach failure because it could not serialize
the object and sent it to worker).

*1) Run-Success:*

*(SpacyExample-Module)*

import spacy

nlp = spacy.load('en_default')

def spacyChunks(content):

doc = nlp(content)

mp=[]

for chunk in doc.noun_chunks:

phrase = content[chunk.start_char: chunk.end_char]

mp.append(phrase)

#print(mp)

return mp



if __name__ == '__main__':

pass


*Main-Module:*

spark = SparkSession.builder.appName("readgzip"
).config(conf=conf).getOrCreate()

gzfile = spark.read.schema(schema).json("")

...

...

textresult.rdd.map(lambda x:x[0]).\

flatMap(lambda data: SpacyExample.spacyChunks(data)).saveAsTextFile("")




*2) Run-Failure:*

*MainModule:*

nlp= spacy.load('en_default')

def spacyChunks(content):

doc = nlp(content)

mp=[]

for chunk in doc.noun_chunks:

phrase = content[chunk.start_char: chunk.end_char]

mp.append(phrase)

#print(mp)

return mp


if __name__ == '__main__'

create spraksession,read file,

file.rdd.map(..).flatmap(lambdat data:spacyChunks(data).saveAsTextFile()


Stack Trace:

  File
"/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py",
line 286, in save

f(self, obj) # Call unbound method with explicit self

  File
"/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py",
line 649, in save_dict

self._batch_setitems(obj.iteritems())

  File
"/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py",
line 681, in _batch_setitems

save(v)

  File
"/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py",
line 331, in save

self.save_reduce(obj=obj, *rv)

  File
"/Users/rs/Downloads/spark-2.0.1-bin-hadoop2.7/python/pyspark/cloudpickle.py",
line 535, in save_reduce

save(args)

  File
"/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py",
line 286, in save

f(self, obj) # Call unbound method with explicit self

  File
"/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py",
line 562, in save_tuple

save(element)

  File
"/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py",
line 286, in save

f(self, obj) # Call unbound method with explicit self

  File
"/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py",
line 649, in save_dict

self._batch_setitems(obj.iteritems())

  File
"/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py",
line 681, in _batch_setitems

save(v)

  File
"/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py",
line 317, in save

self.save_global(obj, rv)

  File
"/Users/rs/Downloads/spark-2.0.1-bin-hadoop2.7/python/pyspark/cloudpickle.py",
line 390, in save_global

raise pickle.PicklingError("Can't pickle %r" % obj)

pickle.PicklingError: Can't pickle . at
0x107e39110>

-- 
Selvam Raman
"லஞ்சம் தவிர்த்து நெஞ்சம் நிமிர்த்து"


Executor unable to pick postgres driver in Spark standalone cluster

2017-04-03 Thread Rishikesh Teke

Hi all,

I was submitting the play application to spark 2.1 standalone cluster . In
play application postgres dependency is also added and application works on
local spark libraries. But at run time on standalone cluster it gives me
error : 

o.a.s.s.TaskSetManager - Lost task 0.0 in stage 0.0 (TID 1, 172.31.21.3,
executor 1): java.lang.ClassNotFoundException: org.postgresql
.Driver

I have placed following in spark-defaults.conf directory

spark.executor.extraClassPath  
/home/ubuntu/downloads/postgres/postgresql-9.4-1200-jdbc41.jar
spark.driver.extraClassPath
/home/ubuntu/downloads/postgres/postgresql-9.4-1200-jdbc41.jar

Still executors unable to pick the driver. 
Am i missing something? Need help . 
Thanks.





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Executor-unable-to-pick-postgres-driver-in-Spark-standalone-cluster-tp28563.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Read file and represent rows as Vectors

2017-04-03 Thread Old-School
I have a dataset that contains DocID, WordID and frequency (count) as shown
below. Note that the first three numbers represent 1. the number of
documents, 2. the number of words in the vocabulary and 3. the total number
of words in the collection.

189
1430
12300
1 2 1
1 39 1
1 42 3
1 77 1
1 95 1
1 96 1
2 105 1
2 108 1
3 133 3


What I want to do is to read the data (ignore the first three lines),
combine the words per document and finally represent each document as a
vector that contains the frequency of the wordID.

Based on the above dataset the representation of documents 1, 2 and 3 will
be (note that vocab_size can be extracted by the second line of the data):

val data = Array(
Vectors.sparse(vocab_size, Seq((2, 1.0), (39, 1.0), (42, 3.0), (77, 1.0),
(95, 1.0), (96, 1.0))),
Vectors.sparse(vocab_size, Seq((105, 1.0), (108, 1.0))),
Vectors.sparse(vocab_size, Seq((133, 3.0


The problem is that I am not quite sure how to read the .txt.gz file as RDD
and create an Array of sparse vectors as described above. Please note that I
actually want to pass the data array in the PCA transformer.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Read-file-and-represent-rows-as-Vectors-tp28562.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Benchmarking streaming frameworks

2017-04-03 Thread Alonso Isidoro Roman
I remember that yahoo did something similar...


https://github.com/yahoo/streaming-benchmarks


Alonso Isidoro Roman
[image: https://]about.me/alonso.isidoro.roman


2017-04-03 9:41 GMT+02:00 gvdongen :

> Dear users of Streaming Technologies,
>
> As a PhD student in big data analytics, I am currently in the process of
> compiling a list of benchmarks (to test multiple streaming frameworks) in
> order to create an expanded benchmarking suite. The benchmark suite is
> being
> developed as a part of my current work at Ghent University.
>
> The included frameworks at this time are, in no particular order, Spark,
> Flink, Kafka (Streams), Storm (Trident) and Drizzle. Any pointers to
> previous work or relevant benchmarks would be appreciated.
>
> Best regards,
> Giselle van Dongen
>
> --
> View this message in context: Benchmarking streaming frameworks
> 
>
> Sent from the Apache Spark User List mailing list archive
>  at Nabble.com.
>


Re: Do we support excluding the current row in PARTITION BY windowing functions?

2017-04-03 Thread mathewwicks
Here is a stackoverflow link:

https://stackoverflow.com/questions/43180723/spark-sql-excluding-the-current-row-in-partition-by-windowing-functions

  



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Do-we-support-excluding-the-current-row-in-PARTITION-BY-windowing-functions-tp28558p28560.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Do we support excluding the current row in PARTITION BY windowing functions?

2017-04-03 Thread mathewwicks
I am not sure why, but the mailing list is saying. "This post has NOT been
accepted by the mailing list yet".

On Mon, 3 Apr 2017 at 20:52 mathewwicks [via Apache Spark User List] <
ml-node+s1001560n28558...@n3.nabble.com> wrote:

> Here is an example to illustrate my point.
>
> In this toy example, we are collecting a list of the other products that
> each user has bought, and appending it as a new column. (Also note, that we
> are filtering on some arbitrary column 'good_bad'.)
>
> I would like to know if we support NOT including the CURRENT ROW in the
> PARTITION BY.
> (E.g. transaction 1 would have `other_purchases = [prod2, prod3]` rather
> than `other_purchases = [prod1, prod2, prod3]`)
>
> --- Code Below ---
>
> df = spark.createDataFrame([
> (1, "user1", "prod1", "good"),
> (2, "user1", "prod2", "good"),
> (3, "user1", "prod3", "good"),
> (4, "user2", "prod3", "bad"),
> (5, "user2", "prod4", "good"),
> (5, "user2", "prod5", "good")],
> ("trans_id", "user_id", "prod_id", "good_bad")
> )
> df.show()
>
> df = df.selectExpr(
> "trans_id",
> "user_id",
> "COLLECT_LIST(CASE WHEN good_bad == 'good' THEN prod_id END)
> OVER(PARTITION BY user_id) AS other_purchases"
> )
> df.show()
> 
>
> --
> If you reply to this email, your message will be added to the discussion
> below:
>
> http://apache-spark-user-list.1001560.n3.nabble.com/Do-we-support-excluding-the-current-row-in-PARTITION-BY-windowing-functions-tp28558.html
> This email was sent by mathewwicks
> 
> (via Nabble)
> To receive all replies by email, subscribe to this discussion
> 
>




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Do-we-support-excluding-the-current-row-in-PARTITION-BY-windowing-functions-tp28558p28559.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Do we support excluding the current row in PARTITION BY windowing functions?

2017-04-03 Thread mathewwicks
Here is an example to illustrate my point.

In this toy example, we are collecting a list of the other products that
each user has bought, and appending it as a new column. (Also note, that we
are filtering on some arbitrary column 'good_bad'.) 

I would like to know if we support NOT including the CURRENT ROW in the
PARTITION BY. 
(E.g. transaction 1 would have `other_purchases = [prod2, prod3]` rather
than `other_purchases = [prod1, prod2, prod3]`)

--- Code Below ---

df = spark.createDataFrame([
(1, "user1", "prod1", "good"), 
(2, "user1", "prod2", "good"), 
(3, "user1", "prod3", "good"), 
(4, "user2", "prod3", "bad"), 
(5, "user2", "prod4", "good"), 
(5, "user2", "prod5", "good")], 
("trans_id", "user_id", "prod_id", "good_bad")
)
df.show()

df = df.selectExpr(
"trans_id", 
"user_id", 
"COLLECT_LIST(CASE WHEN good_bad == 'good' THEN prod_id END)
OVER(PARTITION BY user_id) AS other_purchases"
)
df.show()




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Do-we-support-excluding-the-current-row-in-PARTITION-BY-windowing-functions-tp28558.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Does Apache Spark use any Dependency Injection framework?

2017-04-03 Thread Jacek Laskowski
Hi,

Answering your question from the title (that seems different from what's in
the email) and leaving the other part of how to do it using a DI framework
to others.

Spark does not use any DI framework internally and wires components itself.

Jacek

On 2 Apr 2017 3:29 p.m., "kant kodali"  wrote:

> Hi All,
>
> I am wondering if can get SparkConf
> 
>  object
> through Dependency Injection? I currently use HOCON
>  library to
> store all key/value pairs required to construct SparkConf. The problem is
> as I created multiple client jars(By client jars I mean the one we supply
> for spark-submit to run our App) where each of them requiring its own
> config It would be nice to have SparkConf created by the DI framework
> depending on the client jar we want to run. I am assuming someone must have
> done this?
>
> Thanks!
>


Benchmarking streaming frameworks

2017-04-03 Thread gvdongen
Dear users of Streaming Technologies,

As a PhD student in big data analytics, I am currently in the process of
compiling a list of benchmarks (to test multiple streaming frameworks) in
order to create an expanded benchmarking suite. The benchmark suite is being
developed as a part of my current work at Ghent University.

The included frameworks at this time are, in no particular order, Spark,
Flink, Kafka (Streams), Storm (Trident) and Drizzle. Any pointers to
previous work or relevant benchmarks would be appreciated.

Best regards,
Giselle van Dongen




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Benchmarking-streaming-frameworks-tp28557.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Benchmarking streaming frameworks

2017-04-03 Thread gvdongen
Dear users of Streaming Technologies,

As a PhD student in big data analytics, I am currently in the process of
compiling a list of benchmarks (to test multiple streaming frameworks) in
order to create an expanded benchmarking suite. The benchmark suite is being
developed as a part of my current work at Ghent University.

The included frameworks at this time are, in no particular order, Spark,
Flink, Kafka (Streams), Storm (Trident) and Drizzle. Any pointers to
previous work or relevant benchmarks would be appreciated.

Best regards,
Giselle van Dongen 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Benchmarking-streaming-frameworks-tp28556.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org