Re: Question about Spark and filesystems

2016-12-18 Thread vincent gromakowski
I am using gluster and i have decent performance with basic maintenance
effort. Advantage of gluster: you can plug Alluxio on top to improve perf
but I still need to be validate...

Le 18 déc. 2016 8:50 PM,  a écrit :

> Hello,
>
> We are trying out Spark for some file processing tasks.
>
> Since each Spark worker node needs to access the same files, we have
> tried using Hdfs. This worked, but there were some oddities making me a
> bit uneasy. For dependency hell reasons I compiled a modified Spark, and
> this version exhibited the odd behaviour with Hdfs. The problem might
> have nothing to do with Hdfs, but the situation made me curious about
> the alternatives.
>
> Now I'm wondering what kind of file system would be suitable for our
> deployment.
>
> - There won't be a great number of nodes. Maybe 10 or so.
>
> - The datasets won't be big by big-data standards(Maybe a couple of
>   hundred gb)
>
> So maybe I could just use a NFS server, with a caching client?
> Or should I try Ceph, or Glusterfs?
>
> Does anyone have any experiences to share?
>
> --
> Joakim Verona
> joa...@verona.se
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


RE: PowerIterationClustering Benchmark

2016-12-18 Thread Mostafa Alaa Mohamed
Hi All,

I have the same issue with one compressed file .tgz around 3 GB. I increase the 
nodes without any affect to the performance.


Best Regards,
Mostafa Alaa Mohamed,
Technical Expert Big Data,
M: +971506450787
Email: mohamedamost...@etisalat.ae

From: Lydia Ickler [mailto:ickle...@googlemail.com]
Sent: Friday, December 16, 2016 02:04 AM
To: user@spark.apache.org
Subject: PowerIterationClustering Benchmark

Hi all,

I have a question regarding the PowerIterationClusteringExample.
I have adjusted the code so that it reads a file via 
„sc.textFile(„path/to/input“)“ which works fine.

Now I wanted to benchmark the algorithm using different number of nodes to see 
how well the implementation scales. As a testbed I have up to 32 nodes 
available, each with 16 cores and Spark 2.0.2 on Yarn running.
For my smallest input data set (16MB) the runtime does not really change if I 
use 1,2,4,8,16 or 32 nodes. (always ~ 1.5 minute)
Same behavior for my largest data set (2.3GB). The runtime stays around 1h if I 
use 16 or if I use 32 nodes.

I was expecting that when I e.g. double the number of nodes the runtime would 
shrink.
As for setting up my cluster environment I tried different suggestions from 
this paper https://hal.inria.fr/hal-01347638v1/document

Has someone experienced the same? Or has someone suggestions what might went 
wrong?

Thanks in advance!
Lydia



The content of this email together with any attachments, statements and 
opinions expressed herein contains information that is private and confidential 
are intended for the named addressee(s) only. If you are not the addressee of 
this email you may not copy, forward, disclose or otherwise use it or any part 
of it in any form whatsoever. If you have received this message in error please 
notify postmas...@etisalat.ae by email immediately and delete the message 
without making any copies.


Re: Re: GraphFrame not init vertices when load edges

2016-12-18 Thread zjp_j...@163.com
I'm sorry, i  didn't expressed clearly.  Reference to the following Blod 
Underlined text.

 cite from http://spark.apache.org/docs/latest/graphx-programming-guide.html 
" GraphLoader.edgeListFile provides a way to load a graph from a list of edges 
on disk. It parses an adjacency list of (source vertex ID, destination vertex 
ID) pairs of the following form, skipping comment lines that begin with #:
# This is a comment
2 1
4 1
1 2

It creates a Graph from the specified edges, automatically creating any 
vertices mentioned by edges."

Creating any vertices when create a Graph from specified edges that I think 
it's a good way, but now  GraphLoader.edgeListFile load format is not allowed 
to set edge attribute in edge file, So I want to know GraphFrames has any plan 
about it or better ways.
Thannks







zjp_j...@163.com
 
From: Felix Cheung
Date: 2016-12-19 12:57
To: zjp_j...@163.com; user
Subject: Re: GraphFrame not init vertices when load edges
Or this is a better link: 
http://graphframes.github.io/quick-start.html

_
From: Felix Cheung 
Sent: Sunday, December 18, 2016 8:46 PM
Subject: Re: GraphFrame not init vertices when load edges
To: , user 


Can you clarify?

Vertices should be another DataFrame as you can see in the example here: 
https://github.com/graphframes/graphframes/blob/master/docs/quick-start.md




From: zjp_j...@163.com 
Sent: Sunday, December 18, 2016 6:25:50 PM
To: user
Subject: GraphFrame not init vertices when load edges 
 
Hi,
I fond GraphFrame when create edges not init vertiecs by default, has any plan 
about it or better ways?   Thanks
val e = sqlContext.createDataFrame(List(  ("a", "b", "friend"),  ("b", "c", 
"follow"),  ("c", "b", "follow"),  ("f", "c", "follow"),  ("e", "f", "follow"), 
 ("e", "d", "friend"),  ("d", "a", "friend"),  ("a", "e", 
"friend"))).toDF("src", "dst", "relationship")



zjp_j...@163.com




Re: GraphFrame not init vertices when load edges

2016-12-18 Thread Felix Cheung
There is not a GraphLoader for GraphFrames but you could load and convert from 
GraphX:


http://graphframes.github.io/user-guide.html#graphx-to-graphframe


From: zjp_j...@163.com 
Sent: Sunday, December 18, 2016 9:39:49 PM
To: Felix Cheung; user
Subject: Re: Re: GraphFrame not init vertices when load edges

I'm sorry, i  didn't expressed clearly.  Reference to the following Blod 
Underlined text.

 cite from http://spark.apache.org/docs/latest/graphx-programming-guide.html
" 
GraphLoader.edgeListFile
 provides a way to load a graph from a list of edges on disk. It parses an 
adjacency list of (source vertex ID, destination vertex ID) pairs of the 
following form, skipping comment lines that begin with #:

# This is a comment
2 1
4 1
1 2


It creates a Graph from the specified edges, automatically creating any 
vertices mentioned by edges."


Creating any vertices when create a Graph from specified edges that I think 
it's a good way, but now  
GraphLoader.edgeListFile
 load format is not allowed to set edge attribute in edge file, So I want to 
know GraphFrames has any plan about it or better ways.

Thannks






zjp_j...@163.com

From: Felix Cheung
Date: 2016-12-19 12:57
To: zjp_j...@163.com; 
user
Subject: Re: GraphFrame not init vertices when load edges
Or this is a better link:
http://graphframes.github.io/quick-start.html

_
From: Felix Cheung >
Sent: Sunday, December 18, 2016 8:46 PM
Subject: Re: GraphFrame not init vertices when load edges
To: >, user 
>


Can you clarify?

Vertices should be another DataFrame as you can see in the example here: 
https://github.com/graphframes/graphframes/blob/master/docs/quick-start.md



From: zjp_j...@163.com 
>
Sent: Sunday, December 18, 2016 6:25:50 PM
To: user
Subject: GraphFrame not init vertices when load edges

Hi,
I fond GraphFrame when create edges not init vertiecs by default, has any plan 
about it or better ways?   Thanks

val e = sqlContext.createDataFrame(List(  ("a", "b", "friend"),  ("b", "c", 
"follow"),  ("c", "b", "follow"),  ("f", "c", "follow"),  ("e", "f", "follow"), 
 ("e", "d", "friend"),  ("d", "a", "friend"),  ("a", "e", 
"friend"))).toDF("src", "dst", "relationship")


zjp_j...@163.com




Re: GraphFrame not init vertices when load edges

2016-12-18 Thread Felix Cheung
Or this is a better link:
http://graphframes.github.io/quick-start.html

_
From: Felix Cheung >
Sent: Sunday, December 18, 2016 8:46 PM
Subject: Re: GraphFrame not init vertices when load edges
To: >, user 
>


Can you clarify?

Vertices should be another DataFrame as you can see in the example here: 
https://github.com/graphframes/graphframes/blob/master/docs/quick-start.md



From: zjp_j...@163.com 
>
Sent: Sunday, December 18, 2016 6:25:50 PM
To: user
Subject: GraphFrame not init vertices when load edges

Hi,
I fond GraphFrame when create edges not init vertiecs by default, has any plan 
about it or better ways?   Thanks

val e = sqlContext.createDataFrame(List(  ("a", "b", "friend"),  ("b", "c", 
"follow"),  ("c", "b", "follow"),  ("f", "c", "follow"),  ("e", "f", "follow"), 
 ("e", "d", "friend"),  ("d", "a", "friend"),  ("a", "e", 
"friend"))).toDF("src", "dst", "relationship")


zjp_j...@163.com




Re: GraphFrame not init vertices when load edges

2016-12-18 Thread Felix Cheung
Can you clarify?

Vertices should be another DataFrame as you can see in the example here: 
https://github.com/graphframes/graphframes/blob/master/docs/quick-start.md



From: zjp_j...@163.com 
Sent: Sunday, December 18, 2016 6:25:50 PM
To: user
Subject: GraphFrame not init vertices when load edges

Hi,
I fond GraphFrame when create edges not init vertiecs by default, has any plan 
about it or better ways?   Thanks

val e = sqlContext.createDataFrame(List(
  ("a", "b", "friend"),
  ("b", "c", "follow"),
  ("c", "b", "follow"),
  ("f", "c", "follow"),
  ("e", "f", "follow"),
  ("e", "d", "friend"),
  ("d", "a", "friend"),
  ("a", "e", "friend")
)).toDF("src", "dst", "relationship")


zjp_j...@163.com


GraphFrame not init vertices when load edges

2016-12-18 Thread zjp_j...@163.com
Hi,
I fond GraphFrame when create edges not init vertiecs by default, has any plan 
about it or better ways?   Thanks
val e = sqlContext.createDataFrame(List(
  ("a", "b", "friend"),
  ("b", "c", "follow"),
  ("c", "b", "follow"),
  ("f", "c", "follow"),
  ("e", "f", "follow"),
  ("e", "d", "friend"),
  ("d", "a", "friend"),
  ("a", "e", "friend")
)).toDF("src", "dst", "relationship")



zjp_j...@163.com


Re: The spark hive udf can read broadcast the variables?

2016-12-18 Thread Takeshi Yamamuro
Hi,

No, you can't.
If you use ScalaUdf, you can like this;

val bv = sc.broadcast(100)
val testUdf = udf { (i: Long) => i + bv.value }
spark.range(10).select(testUdf('id)).show


// maropu


On Sun, Dec 18, 2016 at 12:24 AM, 李斌松  wrote:

> The spark hive udf can read broadcast the variables?
>



-- 
---
Takeshi Yamamuro


Re: How to get recent value in spark dataframe

2016-12-18 Thread Richard Xin
I am not sure I understood your logic, but it seems to me that you could take a 
look of Hive's Lead/Lag functions. 

On Monday, December 19, 2016 1:41 AM, Milin korath 
 wrote:
 

 thanks, I tried with left outer join. My dataset having around 400M records 
and lot of shuffling is happening.Is there any other workaround apart from 
Join,I tried use window function but I am not getting a proper solution, 

Thanks
On Sat, Dec 17, 2016 at 4:55 AM, Michael Armbrust  
wrote:

Oh and to get the null for missing years, you'd need to do an outer join with a 
table containing all of the years you are interested in.
On Fri, Dec 16, 2016 at 3:24 PM, Michael Armbrust  
wrote:

Are you looking for argmax? Here is an example.
On Wed, Dec 14, 2016 at 8:49 PM, Milin korath  wrote:

Hi 


| I have a spark data frame with following structure id  flag price date
  a   0100  2015
  a   050   2015
  a   1200  2014
  a   1300  2013
  a   0400  2012I need to create a data frame with recent value of flag 1 
and updated in the flag 0 rows.  id  flag price date new_column
  a   0100  2015200
  a   050   2015200
  a   1200  2014null
  a   1300  2013null
  a   0400  2012nullWe have 2 rows having flag=0. Consider the 
first row(flag=0),I will have 2 values(200 and 300) and I am taking the recent 
one 200(2014). And the last row I don't have any recent value for flag 1 so it 
is updated with null.Looking for a solution using scala. Any help would be 
appreciated.Thanks |
|


Thanks Milin







   

[Spark SQL] Task failed while writing rows

2016-12-18 Thread Joseph Naegele

Hi all,

I'm having trouble with a relatively simple Spark SQL job. I'm using Spark 
1.6.3. I have a dataset of around 500M rows (average 128 bytes per record). 
It's current compressed size is around 13 GB, but my problem started when it 
was much smaller, maybe 5 GB. This dataset is generated by performing a query 
on an existing ORC dataset in HDFS, selecting a subset of the existing data 
(i.e. removing duplicates). When I write this dataset to HDFS using ORC I get 
the following exceptions in the driver:

|org.apache.spark.SparkException: Task failed ||while| |writing rows
| |Caused by: java.lang.RuntimeException: Failed to commit task
| |Suppressed: java.lang.IllegalArgumentException: Column has wrong number of 
index entries found: ||0| |expected: ||32
|
|Caused by: java.io.IOException: All datanodes ||127.0||.||0.1||:||50010| |are 
bad. Aborting...

|
This happens multiple times. The executors tell me the following a few times 
before the same exceptions as above:

|
|2016||-||12||-||09| |02||:||38||:||12.193| |INFO DefaultWriterContainer: Using 
output committer ||class| 
|org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter|
2016||-||12||-||09| |02||:||41||:||04.679| |WARN DFSClient: DFSOutputStream 
ResponseProcessor exception ||for| |block 
BP-||1695049761||-||192.168||.||2.211||-||1479228275669||:blk_1073862425_121642|
|java.io.EOFException: Premature EOF: no length prefix available|
|||at 
org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:||2203||)|
|||at 
org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:||176||)|
|||at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:||867||)|

My HDFS datanode says:

|2016||-||12||-||09| |02||:||39||:||24||,||783| |INFO 
org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: 
/||127.0||.||0.1||:||57836||, dest: /||127.0||.||0.1||:||50010||, bytes: 
||14808395||, op: HDFS_WRITE, cliID: 
DFSClient_attempt_201612090102__m_25_0_956624542_193, offset: ||0||, 
srvID: 1003b822-200c-4b93-9f88-f474c0b6ce4a, blockid: 
BP-||1695049761||-||192.168||.||2.211||-||1479228275669||:blk_1073862420_121637,
 duration: ||93026972|
|2016||-||12||-||09| |02||:||39||:||24||,||783| |INFO 
org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder: 
BP-||1695049761||-||192.168||.||2.211||-||1479228275669||:blk_1073862420_121637,
 type=LAST_IN_PIPELINE, downstreams=||0||:[] terminating|
|2016||-||12||-||09| |02||:||39||:||49||,||262| |ERROR 
org.apache.hadoop.hdfs.server.datanode.DataNode: 
||XXX.XXX.XXX.XXX||:||50010||:DataXceiver error processing 
WRITE_BLOCK operation  src: /||127.0||.||0.1||:||57790| |dst: 
/||127.0||.||0.1||:||50010|
|java.net.SocketTimeoutException: ||6| |millis timeout ||while| |waiting 
||for| |channel to be ready ||for| |read. ch : 
java.nio.channels.SocketChannel[connected local=/||127.0||.||0.1||:||50010| 
|remote=/||127.0||.||0.1||:||57790||]|

It looks like the datanode is receiving the block on multiple ports (threads?) 
and one of the sending connections terminates early.

I was originally running 6 executors with 6 cores and 24 GB RAM each (Total: 36 
cores, 144 GB) and experienced many of these issues, where occasionally my job 
would fail altogether. Lowering the number of cores appears to reduce the 
frequency of these errors, however I'm now down to 4 executors with 2 cores 
each (Total: 8 cores), which is significantly less, and still see approximately 
1-3 task failures.

Details:
- Spark 1.6.3 - Standalone
- RDD compression enabled
- HDFS replication disabled
- Everything running on the same host
- Otherwise vanilla configs for Hadoop and Spark

Does anybody have any ideas or hints? I can't imagine the problem is solely 
related to the number of executor cores.

Thanks,
Joe Naegele


Question about Spark and filesystems

2016-12-18 Thread joakim
Hello,

We are trying out Spark for some file processing tasks.

Since each Spark worker node needs to access the same files, we have
tried using Hdfs. This worked, but there were some oddities making me a
bit uneasy. For dependency hell reasons I compiled a modified Spark, and
this version exhibited the odd behaviour with Hdfs. The problem might
have nothing to do with Hdfs, but the situation made me curious about
the alternatives.

Now I'm wondering what kind of file system would be suitable for our
deployment.

- There won't be a great number of nodes. Maybe 10 or so.

- The datasets won't be big by big-data standards(Maybe a couple of
  hundred gb)

So maybe I could just use a NFS server, with a caching client?
Or should I try Ceph, or Glusterfs?

Does anyone have any experiences to share?

-- 
Joakim Verona
joa...@verona.se

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: How to get recent value in spark dataframe

2016-12-18 Thread Milin korath
thanks, I tried with left outer join. My dataset having around 400M records
and lot of shuffling is happening.Is there any other workaround apart from
Join,I tried use window function but I am not getting a proper solution,


Thanks

On Sat, Dec 17, 2016 at 4:55 AM, Michael Armbrust 
wrote:

> Oh and to get the null for missing years, you'd need to do an outer join
> with a table containing all of the years you are interested in.
>
> On Fri, Dec 16, 2016 at 3:24 PM, Michael Armbrust 
> wrote:
>
>> Are you looking for argmax? Here is an example
>> 
>> .
>>
>> On Wed, Dec 14, 2016 at 8:49 PM, Milin korath 
>> wrote:
>>
>>> Hi
>>>
>>> I have a spark data frame with following structure
>>>
>>>  id  flag price date
>>>   a   0100  2015
>>>   a   050   2015
>>>   a   1200  2014
>>>   a   1300  2013
>>>   a   0400  2012
>>>
>>> I need to create a data frame with recent value of flag 1 and updated in
>>> the flag 0 rows.
>>>
>>>   id  flag price date new_column
>>>   a   0100  2015200
>>>   a   050   2015200
>>>   a   1200  2014null
>>>   a   1300  2013null
>>>   a   0400  2012null
>>>
>>> We have 2 rows having flag=0. Consider the first row(flag=0),I will have
>>> 2 values(200 and 300) and I am taking the recent one 200(2014). And the
>>> last row I don't have any recent value for flag 1 so it is updated with
>>> null.
>>>
>>> Looking for a solution using scala. Any help would be appreciated.Thanks
>>>
>>> Thanks
>>> Milin
>>>
>>
>>
>


How to get recent value in spark dataframe

2016-12-18 Thread milinkorath
0
down vote
favorite
I have a spark data frame with following structure

 id  flag price date
  a   0100  2015
  a   050   2015
  a   1200  2014
  a   1300  2013
  a   0400  2012
I need to create a data frame with recent value of flag 1 and updated in the
flag 0 rows.

  id  flag price date new_column
  a   0100  2015200
  a   050   2015200
  a   1200  2014null
  a   1300  2013null
  a   0400  2012null
We have 2 rows having flag=0. Consider the first row(flag=0),I will have 2
values(200 and 300) and I am taking the recent one 200(2014). And the last
row I don't have any recent value for flag 1 so it is updated with null.

I found a solution with left join.My dataset having around 400M records and
join cause lot of shuffling.Is there any better way to find recent value.


Looking for a solution using scala. Any help would be appreciated.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-get-recent-value-in-spark-dataframe-tp28230.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org