Re: Should spark-ec2 get its own repo?
Hey All, I've mostly kept quiet since I am not very active in maintaining this code anymore. However, it is a bit odd that the project is split-brained with a lot of the code being on github and some in the Spark repo. If the consensus is to migrate everything to github, that seems okay with me. I would vouch for having user continuity, for instance still have a shim ec2/spark-ec2 script that could perhaps just download and unpack the real script from github. - Patrick On Fri, Jul 31, 2015 at 2:13 PM, Shivaram Venkataraman shiva...@eecs.berkeley.edu wrote: Yes - It is still in progress, but I have just not gotten time to get to this. I think getting the repo moved from mesos to amplab in the codebase by 1.5 should be possible. Thanks Shivaram On Fri, Jul 31, 2015 at 3:08 AM, Sean Owen so...@cloudera.com wrote: PS is this still in progress? it feels like something that would be good to do before 1.5.0, if it's going to happen soon. On Wed, Jul 22, 2015 at 6:59 AM, Shivaram Venkataraman shiva...@eecs.berkeley.edu wrote: Yeah I'll send a note to the mesos dev list just to make sure they are informed. Shivaram On Tue, Jul 21, 2015 at 11:47 AM, Sean Owen so...@cloudera.com wrote: I agree it's worth informing Mesos devs and checking that there are no big objections. I presume Shivaram is plugged in enough to Mesos that there won't be any surprises there, and that the project would also agree with moving this Spark-specific bit out. they may also want to leave a pointer to the new location in the mesos repo of course. I don't think it is something that requires a formal vote. It's not a question of ownership -- neither Apache nor the project PMC owns the code. I don't think it's different from retiring or removing any other code. On Tue, Jul 21, 2015 at 7:03 PM, Mridul Muralidharan mri...@gmail.com wrote: If I am not wrong, since the code was hosted within mesos project repo, I assume (atleast part of it) is owned by mesos project and so its PMC ? - Mridul On Tue, Jul 21, 2015 at 9:22 AM, Shivaram Venkataraman shiva...@eecs.berkeley.edu wrote: There is technically no PMC for the spark-ec2 project (I guess we are kind of establishing one right now). I haven't heard anything from the Spark PMC on the dev list that might suggest a need for a vote so far. I will send another round of email notification to the dev list when we have a JIRA / PR that actually moves the scripts (right now the only thing that changed is the location of some scripts in mesos/ to amplab/). Thanks Shivaram - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Came across Spark SQL hang/Error issue with Spark 1.5 Tungsten feature
It would also be great to test this with codegen and unsafe enabled but while continuing to use sort shuffle manager instead of the new tungsten-sort one. On Fri, Jul 31, 2015 at 1:39 AM, Reynold Xin r...@databricks.com wrote: Is this deterministically reproducible? Can you try this on the latest master branch? Would be great to turn debug logging and and dump the generated code. Also would be great to dump the array size at your line 314 in UnsafeRow (and whatever master branch's appropriate line is). On Fri, Jul 31, 2015 at 1:31 AM, james yiaz...@gmail.com wrote: Another error: 15/07/31 16:15:28 INFO spark.MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 3 to bignode1:40443 15/07/31 16:15:28 INFO spark.MapOutputTrackerMaster: Size of output statuses for shuffle 3 is 583 bytes 15/07/31 16:15:28 INFO spark.MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 3 to bignode1:40474 15/07/31 16:15:28 INFO spark.MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 3 to bignode2:34052 15/07/31 16:15:28 INFO spark.MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 3 to bignode3:46929 15/07/31 16:15:28 INFO spark.MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 3 to bignode3:50890 15/07/31 16:15:28 INFO spark.MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 3 to bignode2:47795 15/07/31 16:15:28 INFO spark.MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 3 to bignode4:35120 15/07/31 16:15:28 INFO scheduler.TaskSetManager: Finished task 32.0 in stage 151.0 (TID 1203) in 155 ms on bignode3 (1/50) 15/07/31 16:15:28 INFO scheduler.TaskSetManager: Finished task 35.0 in stage 151.0 (TID 1204) in 157 ms on bignode2 (2/50) 15/07/31 16:15:28 INFO scheduler.TaskSetManager: Finished task 8.0 in stage 151.0 (TID 1196) in 168 ms on bignode3 (3/50) 15/07/31 16:15:28 WARN scheduler.TaskSetManager: Lost task 46.0 in stage 151.0 (TID 1184, bignode1): java.lang.NegativeArraySizeException at org.apache.spark.sql.catalyst.expressions.UnsafeRow.getBinary(UnsafeRow.java:314) at org.apache.spark.sql.catalyst.expressions.UnsafeRow.getUTF8String(UnsafeRow.java:297) at SC$SpecificProjection.apply(Unknown Source) at org.apache.spark.sql.catalyst.expressions.FromUnsafeProjection.apply(Projection.scala:152) at org.apache.spark.sql.catalyst.expressions.FromUnsafeProjection.apply(Projection.scala:140) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$$anon$10.next(Iterator.scala:312) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at org.apache.spark.shuffle.unsafe.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:148) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:71) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:86) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Came-across-Spark-SQL-hang-Error-issue-with-Spark-1-5-Tungsten-feature-tp13537p13538.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: DataFrame#rdd doesn't respect DataFrame#cache, slowing down CrossValidator
Sweet! It's here: https://issues.apache.org/jira/browse/SPARK-9141?focusedCommentId=14649437page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14649437 On Tue, Jul 28, 2015 at 11:21 PM Michael Armbrust mich...@databricks.com wrote: Can you add your description of the problem as a comment to that ticket and we'll make sure to test both cases and break it out if the root cause ends up being different. On Tue, Jul 28, 2015 at 2:48 PM, Justin Uang justin.u...@gmail.com wrote: Sweet! Does this cover DataFrame#rdd also using the cached query from DataFrame#cache? I think the ticket 9141 is mainly concerned with whether a derived DataFrame (B) of a cached DataFrame (A) uses the cached query of A, not whether the rdd from A.rdd or B.rdd uses the cached query of A. On Tue, Jul 28, 2015 at 11:33 PM Joseph Bradley jos...@databricks.com wrote: Thanks for bringing this up! I talked with Michael Armbrust, and it sounds like this is a from a bug in DataFrame caching: https://issues.apache.org/jira/browse/SPARK-9141 It's marked as a blocker for 1.5. Joseph On Tue, Jul 28, 2015 at 2:36 AM, Justin Uang justin.u...@gmail.com wrote: Hey guys, I'm running into some pretty bad performance issues when it comes to using a CrossValidator, because of caching behavior of DataFrames. The root of the problem is that while I have cached my DataFrame representing the features and labels, it is caching at the DataFrame level, while CrossValidator/LogisticRegression both drop down to the dataset.rdd level, which ignores the caching that I have previously done. This is worsened by the fact that for each combination of a fold and a param set from the grid, it recomputes my entire input dataset because the caching was lost. My current solution is to force the input DataFrame to be based off of a cached RDD, which I did with this horrible hack (had to drop down to java from the pyspark because of something to do with vectors not be inferred correctly): def checkpoint_dataframe_caching(df): return DataFrame(sqlContext._ssql_ctx.createDataFrame(df._jdf.rdd().cache(), train_data._jdf.schema()), sqlContext) before I pass it into the CrossValidator.fit(). If I do this, I still have to cache the underlying rdd once more than necessary (in addition to DataFrame#cache()), but at least in cross validation, it doesn't recompute the RDD graph anymore. Note, that input_df.rdd.cache() doesn't work because the python CrossValidator implementation applies some more dataframe transformations like filter, which then causes filtered_df.rdd to return a completely different rdd that recomputes the entire graph. Is it the intention of Spark SQL that calling DataFrame#rdd removes any caching that was done for the query? Is the fix as simple as getting the DataFrame#rdd to reference the cached query, or is there something more subtle going on. Best, Justin
Re: FrequentItems in spark-sql-execution-stat
this looks like a mistake in FrequentItems to me. if the map is full (map.size==size) then it should still add the new item (after removing items from the map and decrementing counts). if its not a mistake then at least it looks to me like the algo is different than described in the paper. is this maybe on purpose? On Thu, Jul 30, 2015 at 4:26 PM, Yucheng yl2...@nyu.edu wrote: Hi all, I'm reading the code in spark-sql-execution-stat-FrequentItems.scala, and I'm a little confused about the add method in the FreqItemCounter class. Please see the link here: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/stat/FrequentItems.scala My question is when the baseMap does not contain the key, and the size of the baseMap is not less than size, why should we just keep the key/value pairs whose value is greater than count? Just like this example: Now the baseMap is Map(1 - 3, 2 - 3, 3 - 4), and the size is 3. I want to add Map(4 - 25) into this baseMap, so it will retain the key/values whose value is greater than 25, and in that way, the baseMap will be null. However, I think we should at least add 4 - 25 into the baseMap. Could anybody help me with this problem? Best, Yucheng -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/FrequentItems-in-spark-sql-execution-stat-tp13527.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Spark CBO
Hi, there is one cost-based analyzer implemented in Spark SQL, if I'm not mistaken, regarding the Join operations, If the join operation is done with a small dataset then Spark SQL's strategy will be to broadcast automatically the small dataset instead of shuffling. I guess you have something else on your mind ? Regards, Olivier. 2015-07-31 8:38 GMT+02:00 burakkk burak.isi...@gmail.com: Hi everyone, I'm wondering that is there any plan to implement cost-based optimizer for Spark SQL? Best regards... -- *BURAK ISIKLI* | *http://burakisikli.wordpress.com http://burakisikli.wordpress.com* -- *Olivier Girardot* | Associé o.girar...@lateral-thoughts.com +33 6 24 09 17 94
Re: Should spark-ec2 get its own repo?
Yes - It is still in progress, but I have just not gotten time to get to this. I think getting the repo moved from mesos to amplab in the codebase by 1.5 should be possible. Thanks Shivaram On Fri, Jul 31, 2015 at 3:08 AM, Sean Owen so...@cloudera.com wrote: PS is this still in progress? it feels like something that would be good to do before 1.5.0, if it's going to happen soon. On Wed, Jul 22, 2015 at 6:59 AM, Shivaram Venkataraman shiva...@eecs.berkeley.edu wrote: Yeah I'll send a note to the mesos dev list just to make sure they are informed. Shivaram On Tue, Jul 21, 2015 at 11:47 AM, Sean Owen so...@cloudera.com wrote: I agree it's worth informing Mesos devs and checking that there are no big objections. I presume Shivaram is plugged in enough to Mesos that there won't be any surprises there, and that the project would also agree with moving this Spark-specific bit out. they may also want to leave a pointer to the new location in the mesos repo of course. I don't think it is something that requires a formal vote. It's not a question of ownership -- neither Apache nor the project PMC owns the code. I don't think it's different from retiring or removing any other code. On Tue, Jul 21, 2015 at 7:03 PM, Mridul Muralidharan mri...@gmail.com wrote: If I am not wrong, since the code was hosted within mesos project repo, I assume (atleast part of it) is owned by mesos project and so its PMC ? - Mridul On Tue, Jul 21, 2015 at 9:22 AM, Shivaram Venkataraman shiva...@eecs.berkeley.edu wrote: There is technically no PMC for the spark-ec2 project (I guess we are kind of establishing one right now). I haven't heard anything from the Spark PMC on the dev list that might suggest a need for a vote so far. I will send another round of email notification to the dev list when we have a JIRA / PR that actually moves the scripts (right now the only thing that changed is the location of some scripts in mesos/ to amplab/). Thanks Shivaram
Spark CBO
Hi everyone, I'm wondering that is there any plan to implement cost-based optimizer for Spark SQL? Best regards... -- *BURAK ISIKLI* | *http://burakisikli.wordpress.com http://burakisikli.wordpress.com*
Came across Spark SQL hang issue with Spark 1.5 Tungsten feature
I try to enable Tungsten with Spark SQL and set below 3 parameters, but i found the Spark SQL always hang below point. So could you please point me what's the potential cause ? I'd appreciate any input. spark.shuffle.manager=tungsten-sort spark.sql.codegen=true spark.sql.unsafe.enabled=true 15/07/31 15:19:46 INFO scheduler.TaskSetManager: Starting task 110.0 in stage 131.0 (TID 280, bignode3, PROCESS_LOCAL, 1446 bytes) 15/07/31 15:19:46 INFO scheduler.TaskSetManager: Starting task 111.0 in stage 131.0 (TID 281, bignode2, PROCESS_LOCAL, 1446 bytes) 15/07/31 15:19:46 INFO storage.BlockManagerInfo: Added broadcast_132_piece0 in memory on bignode3:38948 (size: 7.4 KB, free: 1766.4 MB) 15/07/31 15:19:46 INFO storage.BlockManagerInfo: Added broadcast_132_piece0 in memory on bignode3:57341 (size: 7.4 KB, free: 1766.4 MB) 15/07/31 15:19:46 INFO storage.BlockManagerInfo: Added broadcast_132_piece0 in memory on bignode1:33229 (size: 7.4 KB, free: 1766.4 MB) 15/07/31 15:19:46 INFO storage.BlockManagerInfo: Added broadcast_132_piece0 in memory on bignode1:42261 (size: 7.4 KB, free: 1766.4 MB) 15/07/31 15:19:46 INFO storage.BlockManagerInfo: Added broadcast_132_piece0 in memory on bignode2:44033 (size: 7.4 KB, free: 1766.4 MB) 15/07/31 15:19:46 INFO storage.BlockManagerInfo: Added broadcast_132_piece0 in memory on bignode2:42863 (size: 7.4 KB, free: 1766.4 MB) 15/07/31 15:19:46 INFO storage.BlockManagerInfo: Added broadcast_132_piece0 in memory on bignode4:58639 (size: 7.4 KB, free: 1766.4 MB) 15/07/31 15:19:46 INFO spark.MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 3 to bignode3:46462 15/07/31 15:19:46 INFO spark.MapOutputTrackerMaster: Size of output statuses for shuffle 3 is 71847 bytes 15/07/31 15:19:46 INFO spark.MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 3 to bignode3:38803 15/07/31 15:19:46 INFO spark.MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 3 to bignode1:35241 15/07/31 15:19:46 INFO spark.MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 3 to bignode1:48323 15/07/31 15:19:46 INFO spark.MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 3 to bignode2:56697 15/07/31 15:19:46 INFO spark.MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 3 to bignode4:55810 15/07/31 15:19:46 INFO spark.MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 3 to bignode2:37386 -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Came-across-Spark-SQL-hang-issue-with-Spark-1-5-Tungsten-feature-tp13537.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
New Feature Request
Dear Spark Dev Community, I am wondering if there is already a function to solve my problem. If not, then should I work on this? Say you just want to check if a word exists in a huge text file. I could not find better ways than those mentioned here http://www.knowbigdata.com/blog/interview-questions-apache-spark-part-2#q6 . So, I was proposing if we have a function called *exists *in RDD with the following signature: #returns the true if n elements exist which qualify our criteria. #qualifying function would receive the element and its index and return true or false. def *exists*(qualifying_function, n): Regards, Sandeep Giri, +1 347 781 4573 (US) +91-953-899-8962 (IN) www.KnowBigData.com. http://KnowBigData.com. Phone: +1-253-397-1945 (Office) [image: linkedin icon] https://linkedin.com/company/knowbigdata [image: other site icon] http://knowbigdata.com [image: facebook icon] https://facebook.com/knowbigdata [image: twitter icon] https://twitter.com/IKnowBigData https://twitter.com/IKnowBigData
Re: New Feature Request
Hi, the RDD class does not have an exist()-method (in the Scala API), but the functionality you need seems easy to resemble with the existing methods: val containsNMatchingElements = data.filter(qualifying_function).take(n).count() = n Note: I am not sure whether the intermediate take(n) really increases performance, but the idea is to arbitrarily reduce the number of elements in the RDD before counting because we are not interested in the full count. If you need to check specifically whether there is at least one matching occurrence, it is probably preferable to use isEmpty() instead of count() and check whether the result is false: val contains1MatchingElement = !(data.filter(qualifying_function).isEmpty()) Best, Carsten Am 31.07.2015 um 11:11 schrieb Sandeep Giri: Dear Spark Dev Community, I am wondering if there is already a function to solve my problem. If not, then should I work on this? Say you just want to check if a word exists in a huge text file. I could not find better ways than those mentioned here http://www.knowbigdata.com/blog/interview-questions-apache-spark-part-2#q6. So, I was proposing if we have a function called /exists /in RDD with the following signature: #returns the true if n elements exist which qualify our criteria. #qualifying function would receive the element and its index and return true or false. def /exists/(qualifying_function, n): Regards, Sandeep Giri, +1 347 781 4573 (US) +91-953-899-8962 (IN) www.KnowBigData.com. http://KnowBigData.com. Phone: +1-253-397-1945 (Office) linkedin icon https://linkedin.com/company/knowbigdata other site icon http://knowbigdata.com facebook icon https://facebook.com/knowbigdatatwitter icon https://twitter.com/IKnowBigDatahttps://twitter.com/IKnowBigData -- Carsten Schnober Doctoral Researcher Ubiquitous Knowledge Processing (UKP) Lab FB 20 / Computer Science Department Technische Universität Darmstadt Hochschulstr. 10, D-64289 Darmstadt, Germany phone [+49] (0)6151 16-6227, fax -5455, room S2/02/B111 schno...@ukp.informatik.tu-darmstadt.de www.ukp.tu-darmstadt.de Web Research at TU Darmstadt (WeRC): www.werc.tu-darmstadt.de GRK 1994: Adaptive Preparation of Information from Heterogeneous Sources (AIPHES): www.aiphes.tu-darmstadt.de PhD program: Knowledge Discovery in Scientific Literature (KDSL) www.kdsl.tu-darmstadt.de - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Came across Spark SQL hang/Error issue with Spark 1.5 Tungsten feature
Another error: 15/07/31 16:15:28 INFO spark.MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 3 to bignode1:40443 15/07/31 16:15:28 INFO spark.MapOutputTrackerMaster: Size of output statuses for shuffle 3 is 583 bytes 15/07/31 16:15:28 INFO spark.MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 3 to bignode1:40474 15/07/31 16:15:28 INFO spark.MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 3 to bignode2:34052 15/07/31 16:15:28 INFO spark.MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 3 to bignode3:46929 15/07/31 16:15:28 INFO spark.MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 3 to bignode3:50890 15/07/31 16:15:28 INFO spark.MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 3 to bignode2:47795 15/07/31 16:15:28 INFO spark.MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 3 to bignode4:35120 15/07/31 16:15:28 INFO scheduler.TaskSetManager: Finished task 32.0 in stage 151.0 (TID 1203) in 155 ms on bignode3 (1/50) 15/07/31 16:15:28 INFO scheduler.TaskSetManager: Finished task 35.0 in stage 151.0 (TID 1204) in 157 ms on bignode2 (2/50) 15/07/31 16:15:28 INFO scheduler.TaskSetManager: Finished task 8.0 in stage 151.0 (TID 1196) in 168 ms on bignode3 (3/50) 15/07/31 16:15:28 WARN scheduler.TaskSetManager: Lost task 46.0 in stage 151.0 (TID 1184, bignode1): java.lang.NegativeArraySizeException at org.apache.spark.sql.catalyst.expressions.UnsafeRow.getBinary(UnsafeRow.java:314) at org.apache.spark.sql.catalyst.expressions.UnsafeRow.getUTF8String(UnsafeRow.java:297) at SC$SpecificProjection.apply(Unknown Source) at org.apache.spark.sql.catalyst.expressions.FromUnsafeProjection.apply(Projection.scala:152) at org.apache.spark.sql.catalyst.expressions.FromUnsafeProjection.apply(Projection.scala:140) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$$anon$10.next(Iterator.scala:312) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at org.apache.spark.shuffle.unsafe.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:148) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:71) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:86) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Came-across-Spark-SQL-hang-Error-issue-with-Spark-1-5-Tungsten-feature-tp13537p13538.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Came across Spark SQL hang/Error issue with Spark 1.5 Tungsten feature
Is this deterministically reproducible? Can you try this on the latest master branch? Would be great to turn debug logging and and dump the generated code. Also would be great to dump the array size at your line 314 in UnsafeRow (and whatever master branch's appropriate line is). On Fri, Jul 31, 2015 at 1:31 AM, james yiaz...@gmail.com wrote: Another error: 15/07/31 16:15:28 INFO spark.MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 3 to bignode1:40443 15/07/31 16:15:28 INFO spark.MapOutputTrackerMaster: Size of output statuses for shuffle 3 is 583 bytes 15/07/31 16:15:28 INFO spark.MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 3 to bignode1:40474 15/07/31 16:15:28 INFO spark.MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 3 to bignode2:34052 15/07/31 16:15:28 INFO spark.MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 3 to bignode3:46929 15/07/31 16:15:28 INFO spark.MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 3 to bignode3:50890 15/07/31 16:15:28 INFO spark.MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 3 to bignode2:47795 15/07/31 16:15:28 INFO spark.MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 3 to bignode4:35120 15/07/31 16:15:28 INFO scheduler.TaskSetManager: Finished task 32.0 in stage 151.0 (TID 1203) in 155 ms on bignode3 (1/50) 15/07/31 16:15:28 INFO scheduler.TaskSetManager: Finished task 35.0 in stage 151.0 (TID 1204) in 157 ms on bignode2 (2/50) 15/07/31 16:15:28 INFO scheduler.TaskSetManager: Finished task 8.0 in stage 151.0 (TID 1196) in 168 ms on bignode3 (3/50) 15/07/31 16:15:28 WARN scheduler.TaskSetManager: Lost task 46.0 in stage 151.0 (TID 1184, bignode1): java.lang.NegativeArraySizeException at org.apache.spark.sql.catalyst.expressions.UnsafeRow.getBinary(UnsafeRow.java:314) at org.apache.spark.sql.catalyst.expressions.UnsafeRow.getUTF8String(UnsafeRow.java:297) at SC$SpecificProjection.apply(Unknown Source) at org.apache.spark.sql.catalyst.expressions.FromUnsafeProjection.apply(Projection.scala:152) at org.apache.spark.sql.catalyst.expressions.FromUnsafeProjection.apply(Projection.scala:140) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$$anon$10.next(Iterator.scala:312) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at org.apache.spark.shuffle.unsafe.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:148) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:71) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:86) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Came-across-Spark-SQL-hang-Error-issue-with-Spark-1-5-Tungsten-feature-tp13537p13538.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: New Feature Request
Hello ! You could try something like that : def exists[T](rdd:RDD[T])(f:T=Boolean, n:Int):Boolean = { rdd.filter(f).countApprox(timeout = 1).getFinalValue().low n } If would work for large datasets and large value of n. Have a nice day, Jonathan On 31 July 2015 at 11:29, Carsten Schnober schno...@ukp.informatik.tu-darmstadt.de wrote: Hi, the RDD class does not have an exist()-method (in the Scala API), but the functionality you need seems easy to resemble with the existing methods: val containsNMatchingElements = data.filter(qualifying_function).take(n).count() = n Note: I am not sure whether the intermediate take(n) really increases performance, but the idea is to arbitrarily reduce the number of elements in the RDD before counting because we are not interested in the full count. If you need to check specifically whether there is at least one matching occurrence, it is probably preferable to use isEmpty() instead of count() and check whether the result is false: val contains1MatchingElement = !(data.filter(qualifying_function).isEmpty()) Best, Carsten Am 31.07.2015 um 11:11 schrieb Sandeep Giri: Dear Spark Dev Community, I am wondering if there is already a function to solve my problem. If not, then should I work on this? Say you just want to check if a word exists in a huge text file. I could not find better ways than those mentioned here http://www.knowbigdata.com/blog/interview-questions-apache-spark-part-2#q6 . So, I was proposing if we have a function called /exists /in RDD with the following signature: #returns the true if n elements exist which qualify our criteria. #qualifying function would receive the element and its index and return true or false. def /exists/(qualifying_function, n): Regards, Sandeep Giri, +1 347 781 4573 (US) +91-953-899-8962 (IN) www.KnowBigData.com. http://KnowBigData.com. Phone: +1-253-397-1945 (Office) linkedin icon https://linkedin.com/company/knowbigdata other site icon http://knowbigdata.com facebook icon https://facebook.com/knowbigdatatwitter icon https://twitter.com/IKnowBigDatahttps://twitter.com/IKnowBigData -- Carsten Schnober Doctoral Researcher Ubiquitous Knowledge Processing (UKP) Lab FB 20 / Computer Science Department Technische Universität Darmstadt Hochschulstr. 10, D-64289 Darmstadt, Germany phone [+49] (0)6151 16-6227, fax -5455, room S2/02/B111 schno...@ukp.informatik.tu-darmstadt.de www.ukp.tu-darmstadt.de Web Research at TU Darmstadt (WeRC): www.werc.tu-darmstadt.de GRK 1994: Adaptive Preparation of Information from Heterogeneous Sources (AIPHES): www.aiphes.tu-darmstadt.de PhD program: Knowledge Discovery in Scientific Literature (KDSL) www.kdsl.tu-darmstadt.de - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Should spark-ec2 get its own repo?
PS is this still in progress? it feels like something that would be good to do before 1.5.0, if it's going to happen soon. On Wed, Jul 22, 2015 at 6:59 AM, Shivaram Venkataraman shiva...@eecs.berkeley.edu wrote: Yeah I'll send a note to the mesos dev list just to make sure they are informed. Shivaram On Tue, Jul 21, 2015 at 11:47 AM, Sean Owen so...@cloudera.com wrote: I agree it's worth informing Mesos devs and checking that there are no big objections. I presume Shivaram is plugged in enough to Mesos that there won't be any surprises there, and that the project would also agree with moving this Spark-specific bit out. they may also want to leave a pointer to the new location in the mesos repo of course. I don't think it is something that requires a formal vote. It's not a question of ownership -- neither Apache nor the project PMC owns the code. I don't think it's different from retiring or removing any other code. On Tue, Jul 21, 2015 at 7:03 PM, Mridul Muralidharan mri...@gmail.com wrote: If I am not wrong, since the code was hosted within mesos project repo, I assume (atleast part of it) is owned by mesos project and so its PMC ? - Mridul On Tue, Jul 21, 2015 at 9:22 AM, Shivaram Venkataraman shiva...@eecs.berkeley.edu wrote: There is technically no PMC for the spark-ec2 project (I guess we are kind of establishing one right now). I haven't heard anything from the Spark PMC on the dev list that might suggest a need for a vote so far. I will send another round of email notification to the dev list when we have a JIRA / PR that actually moves the scripts (right now the only thing that changed is the location of some scripts in mesos/ to amplab/). Thanks Shivaram - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org