Re: Should spark-ec2 get its own repo?

2015-07-31 Thread Patrick Wendell
Hey All,

I've mostly kept quiet since I am not very active in maintaining this
code anymore. However, it is a bit odd that the project is
split-brained with a lot of the code being on github and some in the
Spark repo.

If the consensus is to migrate everything to github, that seems okay
with me. I would vouch for having user continuity, for instance still
have a shim ec2/spark-ec2 script that could perhaps just download
and unpack the real script from github.

- Patrick

On Fri, Jul 31, 2015 at 2:13 PM, Shivaram Venkataraman
shiva...@eecs.berkeley.edu wrote:
 Yes - It is still in progress, but I have just not gotten time to get to
 this. I think getting the repo moved from mesos to amplab in the codebase by
 1.5 should be possible.

 Thanks
 Shivaram

 On Fri, Jul 31, 2015 at 3:08 AM, Sean Owen so...@cloudera.com wrote:

 PS is this still in progress? it feels like something that would be
 good to do before 1.5.0, if it's going to happen soon.

 On Wed, Jul 22, 2015 at 6:59 AM, Shivaram Venkataraman
 shiva...@eecs.berkeley.edu wrote:
  Yeah I'll send a note to the mesos dev list just to make sure they are
  informed.
 
  Shivaram
 
  On Tue, Jul 21, 2015 at 11:47 AM, Sean Owen so...@cloudera.com wrote:
 
  I agree it's worth informing Mesos devs and checking that there are no
  big objections. I presume Shivaram is plugged in enough to Mesos that
  there won't be any surprises there, and that the project would also
  agree with moving this Spark-specific bit out. they may also want to
  leave a pointer to the new location in the mesos repo of course.
 
  I don't think it is something that requires a formal vote. It's not a
  question of ownership -- neither Apache nor the project PMC owns the
  code. I don't think it's different from retiring or removing any other
  code.
 
 
 
 
 
  On Tue, Jul 21, 2015 at 7:03 PM, Mridul Muralidharan mri...@gmail.com
  wrote:
   If I am not wrong, since the code was hosted within mesos project
   repo, I assume (atleast part of it) is owned by mesos project and so
   its PMC ?
  
   - Mridul
  
   On Tue, Jul 21, 2015 at 9:22 AM, Shivaram Venkataraman
   shiva...@eecs.berkeley.edu wrote:
   There is technically no PMC for the spark-ec2 project (I guess we
   are
   kind
   of establishing one right now). I haven't heard anything from the
   Spark
   PMC
   on the dev list that might suggest a need for a vote so far. I will
   send
   another round of email notification to the dev list when we have a
   JIRA
   / PR
   that actually moves the scripts (right now the only thing that
   changed
   is
   the location of some scripts in mesos/ to amplab/).
  
   Thanks
   Shivaram
  
 
 



-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Came across Spark SQL hang/Error issue with Spark 1.5 Tungsten feature

2015-07-31 Thread Josh Rosen
It would also be great to test this with codegen and unsafe enabled but
while continuing to use sort shuffle manager instead of the new
tungsten-sort one.

On Fri, Jul 31, 2015 at 1:39 AM, Reynold Xin r...@databricks.com wrote:

 Is this deterministically reproducible? Can you try this on the latest
 master branch?

 Would be great to turn debug logging and and dump the generated code. Also
 would be great to dump the array size at your line 314 in UnsafeRow (and
 whatever master branch's appropriate line is).

 On Fri, Jul 31, 2015 at 1:31 AM, james yiaz...@gmail.com wrote:

 Another error:
 15/07/31 16:15:28 INFO spark.MapOutputTrackerMasterEndpoint: Asked to send
 map output locations for shuffle 3 to bignode1:40443
 15/07/31 16:15:28 INFO spark.MapOutputTrackerMaster: Size of output
 statuses
 for shuffle 3 is 583 bytes
 15/07/31 16:15:28 INFO spark.MapOutputTrackerMasterEndpoint: Asked to send
 map output locations for shuffle 3 to bignode1:40474
 15/07/31 16:15:28 INFO spark.MapOutputTrackerMasterEndpoint: Asked to send
 map output locations for shuffle 3 to bignode2:34052
 15/07/31 16:15:28 INFO spark.MapOutputTrackerMasterEndpoint: Asked to send
 map output locations for shuffle 3 to bignode3:46929
 15/07/31 16:15:28 INFO spark.MapOutputTrackerMasterEndpoint: Asked to send
 map output locations for shuffle 3 to bignode3:50890
 15/07/31 16:15:28 INFO spark.MapOutputTrackerMasterEndpoint: Asked to send
 map output locations for shuffle 3 to bignode2:47795
 15/07/31 16:15:28 INFO spark.MapOutputTrackerMasterEndpoint: Asked to send
 map output locations for shuffle 3 to bignode4:35120
 15/07/31 16:15:28 INFO scheduler.TaskSetManager: Finished task 32.0 in
 stage
 151.0 (TID 1203) in 155 ms on bignode3 (1/50)
 15/07/31 16:15:28 INFO scheduler.TaskSetManager: Finished task 35.0 in
 stage
 151.0 (TID 1204) in 157 ms on bignode2 (2/50)
 15/07/31 16:15:28 INFO scheduler.TaskSetManager: Finished task 8.0 in
 stage
 151.0 (TID 1196) in 168 ms on bignode3 (3/50)
 15/07/31 16:15:28 WARN scheduler.TaskSetManager: Lost task 46.0 in stage
 151.0 (TID 1184, bignode1): java.lang.NegativeArraySizeException
 at

 org.apache.spark.sql.catalyst.expressions.UnsafeRow.getBinary(UnsafeRow.java:314)
 at

 org.apache.spark.sql.catalyst.expressions.UnsafeRow.getUTF8String(UnsafeRow.java:297)
 at SC$SpecificProjection.apply(Unknown Source)
 at

 org.apache.spark.sql.catalyst.expressions.FromUnsafeProjection.apply(Projection.scala:152)
 at

 org.apache.spark.sql.catalyst.expressions.FromUnsafeProjection.apply(Projection.scala:140)
 at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
 at scala.collection.Iterator$$anon$10.next(Iterator.scala:312)
 at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
 at

 org.apache.spark.shuffle.unsafe.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:148)
 at
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:71)
 at
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
 at org.apache.spark.scheduler.Task.run(Task.scala:86)
 at
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
 at

 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at

 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)





 --
 View this message in context:
 http://apache-spark-developers-list.1001551.n3.nabble.com/Came-across-Spark-SQL-hang-Error-issue-with-Spark-1-5-Tungsten-feature-tp13537p13538.html
 Sent from the Apache Spark Developers List mailing list archive at
 Nabble.com.

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org





Re: DataFrame#rdd doesn't respect DataFrame#cache, slowing down CrossValidator

2015-07-31 Thread Justin Uang
Sweet! It's here:
https://issues.apache.org/jira/browse/SPARK-9141?focusedCommentId=14649437page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14649437

On Tue, Jul 28, 2015 at 11:21 PM Michael Armbrust mich...@databricks.com
wrote:

 Can you add your description of the problem as a comment to that ticket
 and we'll make sure to test both cases and break it out if the root cause
 ends up being different.

 On Tue, Jul 28, 2015 at 2:48 PM, Justin Uang justin.u...@gmail.com
 wrote:

 Sweet! Does this cover DataFrame#rdd also using the cached query from
 DataFrame#cache? I think the ticket 9141 is mainly concerned with whether a
 derived DataFrame (B) of a cached DataFrame (A) uses the cached query of A,
 not whether the rdd from A.rdd or B.rdd uses the cached query of A.

 On Tue, Jul 28, 2015 at 11:33 PM Joseph Bradley jos...@databricks.com
 wrote:

 Thanks for bringing this up!  I talked with Michael Armbrust, and it
 sounds like this is a from a bug in DataFrame caching:
 https://issues.apache.org/jira/browse/SPARK-9141
 It's marked as a blocker for 1.5.
 Joseph

 On Tue, Jul 28, 2015 at 2:36 AM, Justin Uang justin.u...@gmail.com
 wrote:

 Hey guys,

 I'm running into some pretty bad performance issues when it comes to
 using a CrossValidator, because of caching behavior of DataFrames.

 The root of the problem is that while I have cached my DataFrame
 representing the features and labels, it is caching at the DataFrame level,
 while CrossValidator/LogisticRegression both drop down to the dataset.rdd
 level, which ignores the caching that I have previously done. This is
 worsened by the fact that for each combination of a fold and a param set
 from the grid, it recomputes my entire input dataset because the caching
 was lost.

 My current solution is to force the input DataFrame to be based off of
 a cached RDD, which I did with this horrible hack (had to drop down to java
 from the pyspark because of something to do with vectors not be inferred
 correctly):

 def checkpoint_dataframe_caching(df):
 return
 DataFrame(sqlContext._ssql_ctx.createDataFrame(df._jdf.rdd().cache(),
 train_data._jdf.schema()), sqlContext)

 before I pass it into the CrossValidator.fit(). If I do this, I still
 have to cache the underlying rdd once more than necessary (in addition to
 DataFrame#cache()), but at least in cross validation, it doesn't recompute
 the RDD graph anymore.

 Note, that input_df.rdd.cache() doesn't work because the python
 CrossValidator implementation applies some more dataframe transformations
 like filter, which then causes filtered_df.rdd to return a completely
 different rdd that recomputes the entire graph.

 Is it the intention of Spark SQL that calling DataFrame#rdd removes any
 caching that was done for the query? Is the fix as simple as getting the
 DataFrame#rdd to reference the cached query, or is there something more
 subtle going on.

 Best,

 Justin






Re: FrequentItems in spark-sql-execution-stat

2015-07-31 Thread Koert Kuipers
this looks like a mistake in FrequentItems to me. if the map is full
(map.size==size) then it should still add the new item (after removing
items from the map and decrementing counts).

if its not a mistake then at least it looks to me like the algo is
different than described in the paper. is this maybe on purpose?

On Thu, Jul 30, 2015 at 4:26 PM, Yucheng yl2...@nyu.edu wrote:

 Hi all,

 I'm reading the code in spark-sql-execution-stat-FrequentItems.scala, and
 I'm a little confused about the add method in the FreqItemCounter class.
 Please see the link here:

 https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/stat/FrequentItems.scala

 My question is when the baseMap does not contain the key, and the size of
 the baseMap is not less than size, why should we just keep the key/value
 pairs whose value is greater than count?

 Just like this example:
 Now the baseMap is Map(1 - 3, 2 - 3, 3 - 4), and the size is 3. I want
 to
 add Map(4 - 25) into this baseMap, so it will retain the key/values whose
 value is greater than 25, and in that way, the baseMap will be null.
 However, I think we should at least add 4 - 25 into the baseMap. Could
 anybody help me with this problem?

 Best,
 Yucheng



 --
 View this message in context:
 http://apache-spark-developers-list.1001551.n3.nabble.com/FrequentItems-in-spark-sql-execution-stat-tp13527.html
 Sent from the Apache Spark Developers List mailing list archive at
 Nabble.com.

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org




Re: Spark CBO

2015-07-31 Thread Olivier Girardot
Hi,
there is one cost-based analyzer implemented in Spark SQL, if I'm not
mistaken, regarding the Join operations,
If the join operation is done with a small dataset then Spark SQL's
strategy will be to broadcast automatically the small dataset instead of
shuffling.

I guess you have something else on your mind ?

Regards,

Olivier.

2015-07-31 8:38 GMT+02:00 burakkk burak.isi...@gmail.com:

 Hi everyone,
 I'm wondering that is there any plan to implement cost-based optimizer for
 Spark SQL?

 Best regards...

 --

 *BURAK ISIKLI* | *http://burakisikli.wordpress.com
 http://burakisikli.wordpress.com*




-- 
*Olivier Girardot* | Associé
o.girar...@lateral-thoughts.com
+33 6 24 09 17 94


Re: Should spark-ec2 get its own repo?

2015-07-31 Thread Shivaram Venkataraman
Yes - It is still in progress, but I have just not gotten time to get to
this. I think getting the repo moved from mesos to amplab in the codebase
by 1.5 should be possible.

Thanks
Shivaram

On Fri, Jul 31, 2015 at 3:08 AM, Sean Owen so...@cloudera.com wrote:

 PS is this still in progress? it feels like something that would be
 good to do before 1.5.0, if it's going to happen soon.

 On Wed, Jul 22, 2015 at 6:59 AM, Shivaram Venkataraman
 shiva...@eecs.berkeley.edu wrote:
  Yeah I'll send a note to the mesos dev list just to make sure they are
  informed.
 
  Shivaram
 
  On Tue, Jul 21, 2015 at 11:47 AM, Sean Owen so...@cloudera.com wrote:
 
  I agree it's worth informing Mesos devs and checking that there are no
  big objections. I presume Shivaram is plugged in enough to Mesos that
  there won't be any surprises there, and that the project would also
  agree with moving this Spark-specific bit out. they may also want to
  leave a pointer to the new location in the mesos repo of course.
 
  I don't think it is something that requires a formal vote. It's not a
  question of ownership -- neither Apache nor the project PMC owns the
  code. I don't think it's different from retiring or removing any other
  code.
 
 
 
 
 
  On Tue, Jul 21, 2015 at 7:03 PM, Mridul Muralidharan mri...@gmail.com
  wrote:
   If I am not wrong, since the code was hosted within mesos project
   repo, I assume (atleast part of it) is owned by mesos project and so
   its PMC ?
  
   - Mridul
  
   On Tue, Jul 21, 2015 at 9:22 AM, Shivaram Venkataraman
   shiva...@eecs.berkeley.edu wrote:
   There is technically no PMC for the spark-ec2 project (I guess we are
   kind
   of establishing one right now). I haven't heard anything from the
 Spark
   PMC
   on the dev list that might suggest a need for a vote so far. I will
   send
   another round of email notification to the dev list when we have a
 JIRA
   / PR
   that actually moves the scripts (right now the only thing that
 changed
   is
   the location of some scripts in mesos/ to amplab/).
  
   Thanks
   Shivaram
  
 
 



Spark CBO

2015-07-31 Thread burakkk
Hi everyone,
I'm wondering that is there any plan to implement cost-based optimizer for
Spark SQL?

Best regards...

-- 

*BURAK ISIKLI* | *http://burakisikli.wordpress.com
http://burakisikli.wordpress.com*


Came across Spark SQL hang issue with Spark 1.5 Tungsten feature

2015-07-31 Thread james
I try to enable Tungsten with Spark SQL and set below 3 parameters, but i
found the Spark SQL always hang below point. So could you please point me
what's the potential cause ? I'd appreciate any input.
spark.shuffle.manager=tungsten-sort
spark.sql.codegen=true
spark.sql.unsafe.enabled=true

15/07/31 15:19:46 INFO scheduler.TaskSetManager: Starting task 110.0 in
stage 131.0 (TID 280, bignode3, PROCESS_LOCAL, 1446 bytes)
15/07/31 15:19:46 INFO scheduler.TaskSetManager: Starting task 111.0 in
stage 131.0 (TID 281, bignode2, PROCESS_LOCAL, 1446 bytes)
15/07/31 15:19:46 INFO storage.BlockManagerInfo: Added broadcast_132_piece0
in memory on bignode3:38948 (size: 7.4 KB, free: 1766.4 MB)
15/07/31 15:19:46 INFO storage.BlockManagerInfo: Added broadcast_132_piece0
in memory on bignode3:57341 (size: 7.4 KB, free: 1766.4 MB)
15/07/31 15:19:46 INFO storage.BlockManagerInfo: Added broadcast_132_piece0
in memory on bignode1:33229 (size: 7.4 KB, free: 1766.4 MB)
15/07/31 15:19:46 INFO storage.BlockManagerInfo: Added broadcast_132_piece0
in memory on bignode1:42261 (size: 7.4 KB, free: 1766.4 MB)
15/07/31 15:19:46 INFO storage.BlockManagerInfo: Added broadcast_132_piece0
in memory on bignode2:44033 (size: 7.4 KB, free: 1766.4 MB)
15/07/31 15:19:46 INFO storage.BlockManagerInfo: Added broadcast_132_piece0
in memory on bignode2:42863 (size: 7.4 KB, free: 1766.4 MB)
15/07/31 15:19:46 INFO storage.BlockManagerInfo: Added broadcast_132_piece0
in memory on bignode4:58639 (size: 7.4 KB, free: 1766.4 MB)
15/07/31 15:19:46 INFO spark.MapOutputTrackerMasterEndpoint: Asked to send
map output locations for shuffle 3 to bignode3:46462
15/07/31 15:19:46 INFO spark.MapOutputTrackerMaster: Size of output statuses
for shuffle 3 is 71847 bytes
15/07/31 15:19:46 INFO spark.MapOutputTrackerMasterEndpoint: Asked to send
map output locations for shuffle 3 to bignode3:38803
15/07/31 15:19:46 INFO spark.MapOutputTrackerMasterEndpoint: Asked to send
map output locations for shuffle 3 to bignode1:35241
15/07/31 15:19:46 INFO spark.MapOutputTrackerMasterEndpoint: Asked to send
map output locations for shuffle 3 to bignode1:48323
15/07/31 15:19:46 INFO spark.MapOutputTrackerMasterEndpoint: Asked to send
map output locations for shuffle 3 to bignode2:56697
15/07/31 15:19:46 INFO spark.MapOutputTrackerMasterEndpoint: Asked to send
map output locations for shuffle 3 to bignode4:55810
15/07/31 15:19:46 INFO spark.MapOutputTrackerMasterEndpoint: Asked to send
map output locations for shuffle 3 to bignode2:37386




--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Came-across-Spark-SQL-hang-issue-with-Spark-1-5-Tungsten-feature-tp13537.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



New Feature Request

2015-07-31 Thread Sandeep Giri
Dear Spark Dev Community,

I am wondering if there is already a function to solve my problem. If not,
then should I work on this?

Say you just want to check if a word exists in a huge text file. I could
not find better ways than those mentioned here
http://www.knowbigdata.com/blog/interview-questions-apache-spark-part-2#q6
.

So, I was proposing if we have a function called *exists *in RDD with the
following signature:

#returns the true if n elements exist which qualify our criteria.
#qualifying function would receive the element and its index and return
true or false.
def *exists*(qualifying_function, n):
 


Regards,
Sandeep Giri,
+1 347 781 4573 (US)
+91-953-899-8962 (IN)

www.KnowBigData.com. http://KnowBigData.com.
Phone: +1-253-397-1945 (Office)

[image: linkedin icon] https://linkedin.com/company/knowbigdata [image:
other site icon] http://knowbigdata.com  [image: facebook icon]
https://facebook.com/knowbigdata [image: twitter icon]
https://twitter.com/IKnowBigData https://twitter.com/IKnowBigData


Re: New Feature Request

2015-07-31 Thread Carsten Schnober
Hi,
the RDD class does not have an exist()-method (in the Scala API), but
the functionality you need seems easy to resemble with the existing methods:

val containsNMatchingElements =
data.filter(qualifying_function).take(n).count() = n

Note: I am not sure whether the intermediate take(n) really increases
performance, but the idea is to arbitrarily reduce the number of
elements in the RDD before counting because we are not interested in the
full count.

If you need to check specifically whether there is at least one matching
occurrence, it is probably preferable to use isEmpty() instead of
count() and check whether the result is false:

val contains1MatchingElement = !(data.filter(qualifying_function).isEmpty())

Best,
Carsten



Am 31.07.2015 um 11:11 schrieb Sandeep Giri:
 Dear Spark Dev Community,
 
 I am wondering if there is already a function to solve my problem. If
 not, then should I work on this?
 
 Say you just want to check if a word exists in a huge text file. I could
 not find better ways than those mentioned here
 http://www.knowbigdata.com/blog/interview-questions-apache-spark-part-2#q6. 
 
 So, I was proposing if we have a function called /exists /in RDD with
 the following signature:
 
 #returns the true if n elements exist which qualify our criteria.
 #qualifying function would receive the element and its index and return
 true or false. 
 def /exists/(qualifying_function, n):
  
 
 
 Regards,
 Sandeep Giri,
 +1 347 781 4573 (US)
 +91-953-899-8962 (IN)
 
 www.KnowBigData.com. http://KnowBigData.com.
 Phone: +1-253-397-1945 (Office)
 
 linkedin icon https://linkedin.com/company/knowbigdata other site icon
 http://knowbigdata.com facebook icon
 https://facebook.com/knowbigdatatwitter icon
 https://twitter.com/IKnowBigDatahttps://twitter.com/IKnowBigData
 

-- 
Carsten Schnober
Doctoral Researcher
Ubiquitous Knowledge Processing (UKP) Lab
FB 20 / Computer Science Department
Technische Universität Darmstadt
Hochschulstr. 10, D-64289 Darmstadt, Germany
phone [+49] (0)6151 16-6227, fax -5455, room S2/02/B111
schno...@ukp.informatik.tu-darmstadt.de
www.ukp.tu-darmstadt.de

Web Research at TU Darmstadt (WeRC): www.werc.tu-darmstadt.de
GRK 1994: Adaptive Preparation of Information from Heterogeneous Sources
(AIPHES): www.aiphes.tu-darmstadt.de
PhD program: Knowledge Discovery in Scientific Literature (KDSL)
www.kdsl.tu-darmstadt.de

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Came across Spark SQL hang/Error issue with Spark 1.5 Tungsten feature

2015-07-31 Thread james
Another error:
15/07/31 16:15:28 INFO spark.MapOutputTrackerMasterEndpoint: Asked to send
map output locations for shuffle 3 to bignode1:40443
15/07/31 16:15:28 INFO spark.MapOutputTrackerMaster: Size of output statuses
for shuffle 3 is 583 bytes
15/07/31 16:15:28 INFO spark.MapOutputTrackerMasterEndpoint: Asked to send
map output locations for shuffle 3 to bignode1:40474
15/07/31 16:15:28 INFO spark.MapOutputTrackerMasterEndpoint: Asked to send
map output locations for shuffle 3 to bignode2:34052
15/07/31 16:15:28 INFO spark.MapOutputTrackerMasterEndpoint: Asked to send
map output locations for shuffle 3 to bignode3:46929
15/07/31 16:15:28 INFO spark.MapOutputTrackerMasterEndpoint: Asked to send
map output locations for shuffle 3 to bignode3:50890
15/07/31 16:15:28 INFO spark.MapOutputTrackerMasterEndpoint: Asked to send
map output locations for shuffle 3 to bignode2:47795
15/07/31 16:15:28 INFO spark.MapOutputTrackerMasterEndpoint: Asked to send
map output locations for shuffle 3 to bignode4:35120
15/07/31 16:15:28 INFO scheduler.TaskSetManager: Finished task 32.0 in stage
151.0 (TID 1203) in 155 ms on bignode3 (1/50)
15/07/31 16:15:28 INFO scheduler.TaskSetManager: Finished task 35.0 in stage
151.0 (TID 1204) in 157 ms on bignode2 (2/50)
15/07/31 16:15:28 INFO scheduler.TaskSetManager: Finished task 8.0 in stage
151.0 (TID 1196) in 168 ms on bignode3 (3/50)
15/07/31 16:15:28 WARN scheduler.TaskSetManager: Lost task 46.0 in stage
151.0 (TID 1184, bignode1): java.lang.NegativeArraySizeException
at
org.apache.spark.sql.catalyst.expressions.UnsafeRow.getBinary(UnsafeRow.java:314)
at
org.apache.spark.sql.catalyst.expressions.UnsafeRow.getUTF8String(UnsafeRow.java:297)
at SC$SpecificProjection.apply(Unknown Source)
at
org.apache.spark.sql.catalyst.expressions.FromUnsafeProjection.apply(Projection.scala:152)
at
org.apache.spark.sql.catalyst.expressions.FromUnsafeProjection.apply(Projection.scala:140)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$10.next(Iterator.scala:312)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at
org.apache.spark.shuffle.unsafe.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:148)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:71)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:86)
at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)





--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Came-across-Spark-SQL-hang-Error-issue-with-Spark-1-5-Tungsten-feature-tp13537p13538.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Came across Spark SQL hang/Error issue with Spark 1.5 Tungsten feature

2015-07-31 Thread Reynold Xin
Is this deterministically reproducible? Can you try this on the latest
master branch?

Would be great to turn debug logging and and dump the generated code. Also
would be great to dump the array size at your line 314 in UnsafeRow (and
whatever master branch's appropriate line is).

On Fri, Jul 31, 2015 at 1:31 AM, james yiaz...@gmail.com wrote:

 Another error:
 15/07/31 16:15:28 INFO spark.MapOutputTrackerMasterEndpoint: Asked to send
 map output locations for shuffle 3 to bignode1:40443
 15/07/31 16:15:28 INFO spark.MapOutputTrackerMaster: Size of output
 statuses
 for shuffle 3 is 583 bytes
 15/07/31 16:15:28 INFO spark.MapOutputTrackerMasterEndpoint: Asked to send
 map output locations for shuffle 3 to bignode1:40474
 15/07/31 16:15:28 INFO spark.MapOutputTrackerMasterEndpoint: Asked to send
 map output locations for shuffle 3 to bignode2:34052
 15/07/31 16:15:28 INFO spark.MapOutputTrackerMasterEndpoint: Asked to send
 map output locations for shuffle 3 to bignode3:46929
 15/07/31 16:15:28 INFO spark.MapOutputTrackerMasterEndpoint: Asked to send
 map output locations for shuffle 3 to bignode3:50890
 15/07/31 16:15:28 INFO spark.MapOutputTrackerMasterEndpoint: Asked to send
 map output locations for shuffle 3 to bignode2:47795
 15/07/31 16:15:28 INFO spark.MapOutputTrackerMasterEndpoint: Asked to send
 map output locations for shuffle 3 to bignode4:35120
 15/07/31 16:15:28 INFO scheduler.TaskSetManager: Finished task 32.0 in
 stage
 151.0 (TID 1203) in 155 ms on bignode3 (1/50)
 15/07/31 16:15:28 INFO scheduler.TaskSetManager: Finished task 35.0 in
 stage
 151.0 (TID 1204) in 157 ms on bignode2 (2/50)
 15/07/31 16:15:28 INFO scheduler.TaskSetManager: Finished task 8.0 in stage
 151.0 (TID 1196) in 168 ms on bignode3 (3/50)
 15/07/31 16:15:28 WARN scheduler.TaskSetManager: Lost task 46.0 in stage
 151.0 (TID 1184, bignode1): java.lang.NegativeArraySizeException
 at

 org.apache.spark.sql.catalyst.expressions.UnsafeRow.getBinary(UnsafeRow.java:314)
 at

 org.apache.spark.sql.catalyst.expressions.UnsafeRow.getUTF8String(UnsafeRow.java:297)
 at SC$SpecificProjection.apply(Unknown Source)
 at

 org.apache.spark.sql.catalyst.expressions.FromUnsafeProjection.apply(Projection.scala:152)
 at

 org.apache.spark.sql.catalyst.expressions.FromUnsafeProjection.apply(Projection.scala:140)
 at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
 at scala.collection.Iterator$$anon$10.next(Iterator.scala:312)
 at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
 at

 org.apache.spark.shuffle.unsafe.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:148)
 at
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:71)
 at
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
 at org.apache.spark.scheduler.Task.run(Task.scala:86)
 at
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
 at

 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at

 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)





 --
 View this message in context:
 http://apache-spark-developers-list.1001551.n3.nabble.com/Came-across-Spark-SQL-hang-Error-issue-with-Spark-1-5-Tungsten-feature-tp13537p13538.html
 Sent from the Apache Spark Developers List mailing list archive at
 Nabble.com.

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org




Re: New Feature Request

2015-07-31 Thread Jonathan Winandy
Hello !

You could try something like that :

def exists[T](rdd:RDD[T])(f:T=Boolean, n:Int):Boolean = {
  rdd.filter(f).countApprox(timeout = 1).getFinalValue().low  n
}

If would work for large datasets and large value of n.

Have a nice day,

Jonathan



On 31 July 2015 at 11:29, Carsten Schnober 
schno...@ukp.informatik.tu-darmstadt.de wrote:

 Hi,
 the RDD class does not have an exist()-method (in the Scala API), but
 the functionality you need seems easy to resemble with the existing
 methods:

 val containsNMatchingElements =
 data.filter(qualifying_function).take(n).count() = n

 Note: I am not sure whether the intermediate take(n) really increases
 performance, but the idea is to arbitrarily reduce the number of
 elements in the RDD before counting because we are not interested in the
 full count.

 If you need to check specifically whether there is at least one matching
 occurrence, it is probably preferable to use isEmpty() instead of
 count() and check whether the result is false:

 val contains1MatchingElement =
 !(data.filter(qualifying_function).isEmpty())

 Best,
 Carsten



 Am 31.07.2015 um 11:11 schrieb Sandeep Giri:
  Dear Spark Dev Community,
 
  I am wondering if there is already a function to solve my problem. If
  not, then should I work on this?
 
  Say you just want to check if a word exists in a huge text file. I could
  not find better ways than those mentioned here
  
 http://www.knowbigdata.com/blog/interview-questions-apache-spark-part-2#q6
 .
 
  So, I was proposing if we have a function called /exists /in RDD with
  the following signature:
 
  #returns the true if n elements exist which qualify our criteria.
  #qualifying function would receive the element and its index and return
  true or false.
  def /exists/(qualifying_function, n):
   
 
 
  Regards,
  Sandeep Giri,
  +1 347 781 4573 (US)
  +91-953-899-8962 (IN)
 
  www.KnowBigData.com. http://KnowBigData.com.
  Phone: +1-253-397-1945 (Office)
 
  linkedin icon https://linkedin.com/company/knowbigdata other site icon
  http://knowbigdata.com facebook icon
  https://facebook.com/knowbigdatatwitter icon
  https://twitter.com/IKnowBigDatahttps://twitter.com/IKnowBigData
 

 --
 Carsten Schnober
 Doctoral Researcher
 Ubiquitous Knowledge Processing (UKP) Lab
 FB 20 / Computer Science Department
 Technische Universität Darmstadt
 Hochschulstr. 10, D-64289 Darmstadt, Germany
 phone [+49] (0)6151 16-6227, fax -5455, room S2/02/B111
 schno...@ukp.informatik.tu-darmstadt.de
 www.ukp.tu-darmstadt.de

 Web Research at TU Darmstadt (WeRC): www.werc.tu-darmstadt.de
 GRK 1994: Adaptive Preparation of Information from Heterogeneous Sources
 (AIPHES): www.aiphes.tu-darmstadt.de
 PhD program: Knowledge Discovery in Scientific Literature (KDSL)
 www.kdsl.tu-darmstadt.de

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org




Re: Should spark-ec2 get its own repo?

2015-07-31 Thread Sean Owen
PS is this still in progress? it feels like something that would be
good to do before 1.5.0, if it's going to happen soon.

On Wed, Jul 22, 2015 at 6:59 AM, Shivaram Venkataraman
shiva...@eecs.berkeley.edu wrote:
 Yeah I'll send a note to the mesos dev list just to make sure they are
 informed.

 Shivaram

 On Tue, Jul 21, 2015 at 11:47 AM, Sean Owen so...@cloudera.com wrote:

 I agree it's worth informing Mesos devs and checking that there are no
 big objections. I presume Shivaram is plugged in enough to Mesos that
 there won't be any surprises there, and that the project would also
 agree with moving this Spark-specific bit out. they may also want to
 leave a pointer to the new location in the mesos repo of course.

 I don't think it is something that requires a formal vote. It's not a
 question of ownership -- neither Apache nor the project PMC owns the
 code. I don't think it's different from retiring or removing any other
 code.





 On Tue, Jul 21, 2015 at 7:03 PM, Mridul Muralidharan mri...@gmail.com
 wrote:
  If I am not wrong, since the code was hosted within mesos project
  repo, I assume (atleast part of it) is owned by mesos project and so
  its PMC ?
 
  - Mridul
 
  On Tue, Jul 21, 2015 at 9:22 AM, Shivaram Venkataraman
  shiva...@eecs.berkeley.edu wrote:
  There is technically no PMC for the spark-ec2 project (I guess we are
  kind
  of establishing one right now). I haven't heard anything from the Spark
  PMC
  on the dev list that might suggest a need for a vote so far. I will
  send
  another round of email notification to the dev list when we have a JIRA
  / PR
  that actually moves the scripts (right now the only thing that changed
  is
  the location of some scripts in mesos/ to amplab/).
 
  Thanks
  Shivaram
 



-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org