Re: Spark Scheduler creating Straggler Node
I don't just want to replicate all Cached Blocks. I am trying to find a way to solve the issue which i mentioned above mail. Having replicas for all cached blocks will add more cost to customers. On Wed, Mar 9, 2016 at 9:50 AM, Reynold Xinwrote: > You just want to be able to replicate hot cached blocks right? > > > On Tuesday, March 8, 2016, Prabhu Joseph > wrote: > >> Hi All, >> >> When a Spark Job is running, and one of the Spark Executor on Node A >> has some partitions cached. Later for some other stage, Scheduler tries to >> assign a task to Node A to process a cached partition (PROCESS_LOCAL). But >> meanwhile the Node A is occupied with some other >> tasks and got busy. Scheduler waits for spark.locality.wait interval and >> times out and tries to find some other node B which is NODE_LOCAL. The >> executor on Node B will try to get the cached partition from Node A which >> adds network IO to node and also some extra CPU for I/O. Eventually, >> every node will have a task that is waiting to fetch some cached >> partition from node A and so the spark job / cluster is basically blocked >> on a single node. >> >> Spark JIRA is created https://issues.apache.org/jira/browse/SPARK-13718 >> >> Beginning from Spark 1.2, Spark introduced External Shuffle Service to >> enable executors fetch shuffle files from an external service instead of >> from each other which will offload the load on Spark Executors. >> >> We want to check whether a similar thing of an External Service is >> implemented for transferring the cached partition to other executors. >> >> >> Thanks, Prabhu Joseph >> >> >>
Re: Inconsistent file extensions and omitting file extensions written by CSV, TEXT and JSON data sources.
Isn't this just specified by the user? On Tue, Mar 8, 2016 at 9:49 PM, Hyukjin Kwonwrote: > Hi all, > > Currently, the output from CSV, TEXT and JSON data sources does not have > file extensions such as .csv, .txt and .json (except for compression > extensions such as .gz, .deflate and .bz4). > > In addition, it looks Parquet has the extensions such as .gz.parquet or > .snappy.parquet according to compression codecs whereas ORC does not have > such extensions but it is just .orc. > > I tried to search some JIRAs related with this but I could not find yet > but I did not open a JIRA directly because I feel like this is already > concerned > > Maybe could I open a JIRA for this inconsistent file extensions? > > It would be thankful if you give me some feedback > > Thanks! >
Inconsistent file extensions and omitting file extensions written by CSV, TEXT and JSON data sources.
Hi all, Currently, the output from CSV, TEXT and JSON data sources does not have file extensions such as .csv, .txt and .json (except for compression extensions such as .gz, .deflate and .bz4). In addition, it looks Parquet has the extensions such as .gz.parquet or .snappy.parquet according to compression codecs whereas ORC does not have such extensions but it is just .orc. I tried to search some JIRAs related with this but I could not find yet but I did not open a JIRA directly because I feel like this is already concerned Maybe could I open a JIRA for this inconsistent file extensions? It would be thankful if you give me some feedback Thanks!
Re: Spark Scheduler creating Straggler Node
You just want to be able to replicate hot cached blocks right? On Tuesday, March 8, 2016, Prabhu Josephwrote: > Hi All, > > When a Spark Job is running, and one of the Spark Executor on Node A > has some partitions cached. Later for some other stage, Scheduler tries to > assign a task to Node A to process a cached partition (PROCESS_LOCAL). But > meanwhile the Node A is occupied with some other > tasks and got busy. Scheduler waits for spark.locality.wait interval and > times out and tries to find some other node B which is NODE_LOCAL. The > executor on Node B will try to get the cached partition from Node A which > adds network IO to node and also some extra CPU for I/O. Eventually, > every node will have a task that is waiting to fetch some cached partition > from node A and so the spark job / cluster is basically blocked on a single > node. > > Spark JIRA is created https://issues.apache.org/jira/browse/SPARK-13718 > > Beginning from Spark 1.2, Spark introduced External Shuffle Service to > enable executors fetch shuffle files from an external service instead of > from each other which will offload the load on Spark Executors. > > We want to check whether a similar thing of an External Service is > implemented for transferring the cached partition to other executors. > > > Thanks, Prabhu Joseph > > >
Re: [VOTE] Release Apache Spark 1.6.1 (RC1)
+1 On Tue, Mar 8, 2016 at 10:59 AM, Andrew Orwrote: > +1 > > 2016-03-08 10:59 GMT-08:00 Yin Huai : > >> +1 >> >> On Mon, Mar 7, 2016 at 12:39 PM, Reynold Xin wrote: >> >>> +1 (binding) >>> >>> >>> On Sun, Mar 6, 2016 at 12:08 PM, Egor Pahomov >>> wrote: >>> +1 Spark ODBC server is fine, SQL is fine. 2016-03-03 12:09 GMT-08:00 Yin Yang : > Skipping docker tests, the rest are green: > > [INFO] Spark Project External Kafka ... SUCCESS > [01:28 min] > [INFO] Spark Project Examples . SUCCESS > [02:59 min] > [INFO] Spark Project External Kafka Assembly .. SUCCESS [ > 11.680 s] > [INFO] > > [INFO] BUILD SUCCESS > [INFO] > > [INFO] Total time: 02:16 h > [INFO] Finished at: 2016-03-03T11:17:07-08:00 > [INFO] Final Memory: 152M/4062M > > On Thu, Mar 3, 2016 at 8:55 AM, Yin Yang wrote: > >> When I ran test suite using the following command: >> >> build/mvn clean -Phive -Phive-thriftserver -Pyarn -Phadoop-2.6 >> -Dhadoop.version=2.7.0 package >> >> I got failure in Spark Project Docker Integration Tests : >> >> 16/03/02 17:36:46 INFO RemoteActorRefProvider$RemotingTerminator: >> Remote daemon shut down; proceeding with flushing remote transports. >> ^[[31m*** RUN ABORTED ***^[[0m >> ^[[31m com.spotify.docker.client.DockerException: >> java.util.concurrent.ExecutionException: >> com.spotify.docker.client.shaded.javax.ws.rs.ProcessingException: >> java.io.IOException: No such file or directory^[[0m >> ^[[31m at >> com.spotify.docker.client.DefaultDockerClient.propagate(DefaultDockerClient.java:1141)^[[0m >> ^[[31m at >> com.spotify.docker.client.DefaultDockerClient.request(DefaultDockerClient.java:1082)^[[0m >> ^[[31m at >> com.spotify.docker.client.DefaultDockerClient.ping(DefaultDockerClient.java:281)^[[0m >> ^[[31m at >> org.apache.spark.sql.jdbc.DockerJDBCIntegrationSuite.beforeAll(DockerJDBCIntegrationSuite.scala:76)^[[0m >> ^[[31m at >> org.scalatest.BeforeAndAfterAll$class.beforeAll(BeforeAndAfterAll.scala:187)^[[0m >> ^[[31m at >> org.apache.spark.sql.jdbc.DockerJDBCIntegrationSuite.beforeAll(DockerJDBCIntegrationSuite.scala:58)^[[0m >> ^[[31m at >> org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:253)^[[0m >> ^[[31m at >> org.apache.spark.sql.jdbc.DockerJDBCIntegrationSuite.run(DockerJDBCIntegrationSuite.scala:58)^[[0m >> ^[[31m at >> org.scalatest.Suite$class.callExecuteOnSuite$1(Suite.scala:1492)^[[0m >> ^[[31m at >> org.scalatest.Suite$$anonfun$runNestedSuites$1.apply(Suite.scala:1528)^[[0m >> ^[[31m ...^[[0m >> ^[[31m Cause: java.util.concurrent.ExecutionException: >> com.spotify.docker.client.shaded.javax.ws.rs.ProcessingException: >> java.io.IOException: No such file or directory^[[0m >> ^[[31m at >> jersey.repackaged.com.google.common.util.concurrent.AbstractFuture$Sync.getValue(AbstractFuture.java:299)^[[0m >> ^[[31m at >> jersey.repackaged.com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:286)^[[0m >> ^[[31m at >> jersey.repackaged.com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:116)^[[0m >> ^[[31m at >> com.spotify.docker.client.DefaultDockerClient.request(DefaultDockerClient.java:1080)^[[0m >> ^[[31m at >> com.spotify.docker.client.DefaultDockerClient.ping(DefaultDockerClient.java:281)^[[0m >> ^[[31m at >> org.apache.spark.sql.jdbc.DockerJDBCIntegrationSuite.beforeAll(DockerJDBCIntegrationSuite.scala:76)^[[0m >> ^[[31m at >> org.scalatest.BeforeAndAfterAll$class.beforeAll(BeforeAndAfterAll.scala:187)^[[0m >> ^[[31m at >> org.apache.spark.sql.jdbc.DockerJDBCIntegrationSuite.beforeAll(DockerJDBCIntegrationSuite.scala:58)^[[0m >> ^[[31m at >> org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:253)^[[0m >> ^[[31m at >> org.apache.spark.sql.jdbc.DockerJDBCIntegrationSuite.run(DockerJDBCIntegrationSuite.scala:58)^[[0m >> ^[[31m ...^[[0m >> ^[[31m Cause: >> com.spotify.docker.client.shaded.javax.ws.rs.ProcessingException: >> java.io.IOException: No such file or directory^[[0m >> ^[[31m at >> org.glassfish.jersey.apache.connector.ApacheConnector.apply(ApacheConnector.java:481)^[[0m >> ^[[31m at >> org.glassfish.jersey.apache.connector.ApacheConnector$1.run(ApacheConnector.java:491)^[[0m >> ^[[31m at >> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)^[[0m
Re: [VOTE] Release Apache Spark 1.6.1 (RC1)
+1 2016-03-08 10:59 GMT-08:00 Yin Huai: > +1 > > On Mon, Mar 7, 2016 at 12:39 PM, Reynold Xin wrote: > >> +1 (binding) >> >> >> On Sun, Mar 6, 2016 at 12:08 PM, Egor Pahomov >> wrote: >> >>> +1 >>> >>> Spark ODBC server is fine, SQL is fine. >>> >>> 2016-03-03 12:09 GMT-08:00 Yin Yang : >>> Skipping docker tests, the rest are green: [INFO] Spark Project External Kafka ... SUCCESS [01:28 min] [INFO] Spark Project Examples . SUCCESS [02:59 min] [INFO] Spark Project External Kafka Assembly .. SUCCESS [ 11.680 s] [INFO] [INFO] BUILD SUCCESS [INFO] [INFO] Total time: 02:16 h [INFO] Finished at: 2016-03-03T11:17:07-08:00 [INFO] Final Memory: 152M/4062M On Thu, Mar 3, 2016 at 8:55 AM, Yin Yang wrote: > When I ran test suite using the following command: > > build/mvn clean -Phive -Phive-thriftserver -Pyarn -Phadoop-2.6 > -Dhadoop.version=2.7.0 package > > I got failure in Spark Project Docker Integration Tests : > > 16/03/02 17:36:46 INFO RemoteActorRefProvider$RemotingTerminator: > Remote daemon shut down; proceeding with flushing remote transports. > ^[[31m*** RUN ABORTED ***^[[0m > ^[[31m com.spotify.docker.client.DockerException: > java.util.concurrent.ExecutionException: > com.spotify.docker.client.shaded.javax.ws.rs.ProcessingException: > java.io.IOException: No such file or directory^[[0m > ^[[31m at > com.spotify.docker.client.DefaultDockerClient.propagate(DefaultDockerClient.java:1141)^[[0m > ^[[31m at > com.spotify.docker.client.DefaultDockerClient.request(DefaultDockerClient.java:1082)^[[0m > ^[[31m at > com.spotify.docker.client.DefaultDockerClient.ping(DefaultDockerClient.java:281)^[[0m > ^[[31m at > org.apache.spark.sql.jdbc.DockerJDBCIntegrationSuite.beforeAll(DockerJDBCIntegrationSuite.scala:76)^[[0m > ^[[31m at > org.scalatest.BeforeAndAfterAll$class.beforeAll(BeforeAndAfterAll.scala:187)^[[0m > ^[[31m at > org.apache.spark.sql.jdbc.DockerJDBCIntegrationSuite.beforeAll(DockerJDBCIntegrationSuite.scala:58)^[[0m > ^[[31m at > org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:253)^[[0m > ^[[31m at > org.apache.spark.sql.jdbc.DockerJDBCIntegrationSuite.run(DockerJDBCIntegrationSuite.scala:58)^[[0m > ^[[31m at > org.scalatest.Suite$class.callExecuteOnSuite$1(Suite.scala:1492)^[[0m > ^[[31m at > org.scalatest.Suite$$anonfun$runNestedSuites$1.apply(Suite.scala:1528)^[[0m > ^[[31m ...^[[0m > ^[[31m Cause: java.util.concurrent.ExecutionException: > com.spotify.docker.client.shaded.javax.ws.rs.ProcessingException: > java.io.IOException: No such file or directory^[[0m > ^[[31m at > jersey.repackaged.com.google.common.util.concurrent.AbstractFuture$Sync.getValue(AbstractFuture.java:299)^[[0m > ^[[31m at > jersey.repackaged.com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:286)^[[0m > ^[[31m at > jersey.repackaged.com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:116)^[[0m > ^[[31m at > com.spotify.docker.client.DefaultDockerClient.request(DefaultDockerClient.java:1080)^[[0m > ^[[31m at > com.spotify.docker.client.DefaultDockerClient.ping(DefaultDockerClient.java:281)^[[0m > ^[[31m at > org.apache.spark.sql.jdbc.DockerJDBCIntegrationSuite.beforeAll(DockerJDBCIntegrationSuite.scala:76)^[[0m > ^[[31m at > org.scalatest.BeforeAndAfterAll$class.beforeAll(BeforeAndAfterAll.scala:187)^[[0m > ^[[31m at > org.apache.spark.sql.jdbc.DockerJDBCIntegrationSuite.beforeAll(DockerJDBCIntegrationSuite.scala:58)^[[0m > ^[[31m at > org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:253)^[[0m > ^[[31m at > org.apache.spark.sql.jdbc.DockerJDBCIntegrationSuite.run(DockerJDBCIntegrationSuite.scala:58)^[[0m > ^[[31m ...^[[0m > ^[[31m Cause: > com.spotify.docker.client.shaded.javax.ws.rs.ProcessingException: > java.io.IOException: No such file or directory^[[0m > ^[[31m at > org.glassfish.jersey.apache.connector.ApacheConnector.apply(ApacheConnector.java:481)^[[0m > ^[[31m at > org.glassfish.jersey.apache.connector.ApacheConnector$1.run(ApacheConnector.java:491)^[[0m > ^[[31m at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)^[[0m > ^[[31m at > java.util.concurrent.FutureTask.run(FutureTask.java:262)^[[0m > ^[[31m at >
Re: [VOTE] Release Apache Spark 1.6.1 (RC1)
+1 On Mon, Mar 7, 2016 at 12:39 PM, Reynold Xinwrote: > +1 (binding) > > > On Sun, Mar 6, 2016 at 12:08 PM, Egor Pahomov > wrote: > >> +1 >> >> Spark ODBC server is fine, SQL is fine. >> >> 2016-03-03 12:09 GMT-08:00 Yin Yang : >> >>> Skipping docker tests, the rest are green: >>> >>> [INFO] Spark Project External Kafka ... SUCCESS >>> [01:28 min] >>> [INFO] Spark Project Examples . SUCCESS >>> [02:59 min] >>> [INFO] Spark Project External Kafka Assembly .. SUCCESS [ >>> 11.680 s] >>> [INFO] >>> >>> [INFO] BUILD SUCCESS >>> [INFO] >>> >>> [INFO] Total time: 02:16 h >>> [INFO] Finished at: 2016-03-03T11:17:07-08:00 >>> [INFO] Final Memory: 152M/4062M >>> >>> On Thu, Mar 3, 2016 at 8:55 AM, Yin Yang wrote: >>> When I ran test suite using the following command: build/mvn clean -Phive -Phive-thriftserver -Pyarn -Phadoop-2.6 -Dhadoop.version=2.7.0 package I got failure in Spark Project Docker Integration Tests : 16/03/02 17:36:46 INFO RemoteActorRefProvider$RemotingTerminator: Remote daemon shut down; proceeding with flushing remote transports. ^[[31m*** RUN ABORTED ***^[[0m ^[[31m com.spotify.docker.client.DockerException: java.util.concurrent.ExecutionException: com.spotify.docker.client.shaded.javax.ws.rs.ProcessingException: java.io.IOException: No such file or directory^[[0m ^[[31m at com.spotify.docker.client.DefaultDockerClient.propagate(DefaultDockerClient.java:1141)^[[0m ^[[31m at com.spotify.docker.client.DefaultDockerClient.request(DefaultDockerClient.java:1082)^[[0m ^[[31m at com.spotify.docker.client.DefaultDockerClient.ping(DefaultDockerClient.java:281)^[[0m ^[[31m at org.apache.spark.sql.jdbc.DockerJDBCIntegrationSuite.beforeAll(DockerJDBCIntegrationSuite.scala:76)^[[0m ^[[31m at org.scalatest.BeforeAndAfterAll$class.beforeAll(BeforeAndAfterAll.scala:187)^[[0m ^[[31m at org.apache.spark.sql.jdbc.DockerJDBCIntegrationSuite.beforeAll(DockerJDBCIntegrationSuite.scala:58)^[[0m ^[[31m at org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:253)^[[0m ^[[31m at org.apache.spark.sql.jdbc.DockerJDBCIntegrationSuite.run(DockerJDBCIntegrationSuite.scala:58)^[[0m ^[[31m at org.scalatest.Suite$class.callExecuteOnSuite$1(Suite.scala:1492)^[[0m ^[[31m at org.scalatest.Suite$$anonfun$runNestedSuites$1.apply(Suite.scala:1528)^[[0m ^[[31m ...^[[0m ^[[31m Cause: java.util.concurrent.ExecutionException: com.spotify.docker.client.shaded.javax.ws.rs.ProcessingException: java.io.IOException: No such file or directory^[[0m ^[[31m at jersey.repackaged.com.google.common.util.concurrent.AbstractFuture$Sync.getValue(AbstractFuture.java:299)^[[0m ^[[31m at jersey.repackaged.com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:286)^[[0m ^[[31m at jersey.repackaged.com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:116)^[[0m ^[[31m at com.spotify.docker.client.DefaultDockerClient.request(DefaultDockerClient.java:1080)^[[0m ^[[31m at com.spotify.docker.client.DefaultDockerClient.ping(DefaultDockerClient.java:281)^[[0m ^[[31m at org.apache.spark.sql.jdbc.DockerJDBCIntegrationSuite.beforeAll(DockerJDBCIntegrationSuite.scala:76)^[[0m ^[[31m at org.scalatest.BeforeAndAfterAll$class.beforeAll(BeforeAndAfterAll.scala:187)^[[0m ^[[31m at org.apache.spark.sql.jdbc.DockerJDBCIntegrationSuite.beforeAll(DockerJDBCIntegrationSuite.scala:58)^[[0m ^[[31m at org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:253)^[[0m ^[[31m at org.apache.spark.sql.jdbc.DockerJDBCIntegrationSuite.run(DockerJDBCIntegrationSuite.scala:58)^[[0m ^[[31m ...^[[0m ^[[31m Cause: com.spotify.docker.client.shaded.javax.ws.rs.ProcessingException: java.io.IOException: No such file or directory^[[0m ^[[31m at org.glassfish.jersey.apache.connector.ApacheConnector.apply(ApacheConnector.java:481)^[[0m ^[[31m at org.glassfish.jersey.apache.connector.ApacheConnector$1.run(ApacheConnector.java:491)^[[0m ^[[31m at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)^[[0m ^[[31m at java.util.concurrent.FutureTask.run(FutureTask.java:262)^[[0m ^[[31m at jersey.repackaged.com.google.common.util.concurrent.MoreExecutors$DirectExecutorService.execute(MoreExecutors.java:299)^[[0m ^[[31m at java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:110)^[[0m ^[[31m at
Re: Spark structured streaming
This is in active development, so there is not much that can be done from an end user perspective. In particular the only sink that is available in apache/master is a testing sink that just stores the data in memory. We are working on a parquet based file sink and will eventually support all the of Data Source API file formats (text, json, csv, orc, parquet). On Tue, Mar 8, 2016 at 7:38 AM, Jacek Laskowskiwrote: > Hi Praveen, > > I don't really know. I think TD or Michael should know as they > personally involved in the task (as far as I could figure it out from > the JIRA and the changes). Ping people on the JIRA so they notice your > question(s). > > Pozdrawiam, > Jacek Laskowski > > https://medium.com/@jaceklaskowski/ > Mastering Apache Spark http://bit.ly/mastering-apache-spark > Follow me at https://twitter.com/jaceklaskowski > > > On Tue, Mar 8, 2016 at 12:32 PM, Praveen Devarao > wrote: > > Thanks Jacek for the pointer. > > > > Any idea which package can be used in .format(). The test cases seem to > work > > out of the DefaultSource class defined within the > DataFrameReaderWriterSuite > > [org.apache.spark.sql.streaming.test.DefaultSource] > > > > Thanking You > > > - > > Praveen Devarao > > Spark Technology Centre > > IBM India Software Labs > > > - > > "Courage doesn't always roar. Sometimes courage is the quiet voice at the > > end of the day saying I will try again" > > > > > > > > From:Jacek Laskowski > > To:Praveen Devarao/India/IBM@IBMIN > > Cc:user , dev > > Date:08/03/2016 04:17 pm > > Subject:Re: Spark structured streaming > > > > > > > > > > Hi Praveen, > > > > I've spent few hours on the changes related to streaming dataframes > > (included in the SPARK-8360) and concluded that it's currently only > > possible to read.stream(), but not write.stream() since there are no > > streaming Sinks yet. > > > > Pozdrawiam, > > Jacek Laskowski > > > > https://medium.com/@jaceklaskowski/ > > Mastering Apache Spark http://bit.ly/mastering-apache-spark > > Follow me at https://twitter.com/jaceklaskowski > > > > > > On Tue, Mar 8, 2016 at 10:38 AM, Praveen Devarao > > wrote: > >> Hi, > >> > >> I would like to get my hands on the structured streaming feature > >> coming out in Spark 2.0. I have tried looking around for code samples to > >> get > >> started but am not able to find any. Only few things I could look into > is > >> the test cases that have been committed under the JIRA umbrella > >> https://issues.apache.org/jira/browse/SPARK-8360butthe test cases don't > >> lead to building a example code as they seem to be working out of > internal > >> classes. > >> > >> Could anyone point me to some resources or pointers in code > that I > >> can start with to understand structured streaming from a consumability > >> angle. > >> > >> Thanking You > >> > >> > - > >> Praveen Devarao > >> Spark Technology Centre > >> IBM India Software Labs > >> > >> > - > >> "Courage doesn't always roar. Sometimes courage is the quiet voice at > the > >> end of the day saying I will try again" > > > > - > > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > > For additional commands, e-mail: user-h...@spark.apache.org > > > > > > > > > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >
Re: Spark structured streaming
Hi Praveen, I don't really know. I think TD or Michael should know as they personally involved in the task (as far as I could figure it out from the JIRA and the changes). Ping people on the JIRA so they notice your question(s). Pozdrawiam, Jacek Laskowski https://medium.com/@jaceklaskowski/ Mastering Apache Spark http://bit.ly/mastering-apache-spark Follow me at https://twitter.com/jaceklaskowski On Tue, Mar 8, 2016 at 12:32 PM, Praveen Devaraowrote: > Thanks Jacek for the pointer. > > Any idea which package can be used in .format(). The test cases seem to work > out of the DefaultSource class defined within the DataFrameReaderWriterSuite > [org.apache.spark.sql.streaming.test.DefaultSource] > > Thanking You > - > Praveen Devarao > Spark Technology Centre > IBM India Software Labs > - > "Courage doesn't always roar. Sometimes courage is the quiet voice at the > end of the day saying I will try again" > > > > From:Jacek Laskowski > To:Praveen Devarao/India/IBM@IBMIN > Cc:user , dev > Date:08/03/2016 04:17 pm > Subject:Re: Spark structured streaming > > > > > Hi Praveen, > > I've spent few hours on the changes related to streaming dataframes > (included in the SPARK-8360) and concluded that it's currently only > possible to read.stream(), but not write.stream() since there are no > streaming Sinks yet. > > Pozdrawiam, > Jacek Laskowski > > https://medium.com/@jaceklaskowski/ > Mastering Apache Spark http://bit.ly/mastering-apache-spark > Follow me at https://twitter.com/jaceklaskowski > > > On Tue, Mar 8, 2016 at 10:38 AM, Praveen Devarao > wrote: >> Hi, >> >> I would like to get my hands on the structured streaming feature >> coming out in Spark 2.0. I have tried looking around for code samples to >> get >> started but am not able to find any. Only few things I could look into is >> the test cases that have been committed under the JIRA umbrella >> https://issues.apache.org/jira/browse/SPARK-8360butthe test cases don't >> lead to building a example code as they seem to be working out of internal >> classes. >> >> Could anyone point me to some resources or pointers in code that I >> can start with to understand structured streaming from a consumability >> angle. >> >> Thanking You >> >> - >> Praveen Devarao >> Spark Technology Centre >> IBM India Software Labs >> >> - >> "Courage doesn't always roar. Sometimes courage is the quiet voice at the >> end of the day saying I will try again" > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > > > - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Use cases for kafka direct stream messageHandler
No, looks like you'd have to catch them in the serializer and have the serializer return option or something. The new consumer builds a buffer full of records, not one at a time. On Mar 8, 2016 4:43 AM, "Marius Soutier"wrote: > > > On 04.03.2016, at 22:39, Cody Koeninger wrote: > > > > The only other valid use of messageHandler that I can think of is > > catching serialization problems on a per-message basis. But with the > > new Kafka consumer library, that doesn't seem feasible anyway, and > > could be handled with a custom (de)serializer. > > What do you mean, that doesn't seem feasible? You mean when using a custom > deserializer? Right now I'm catching serialization problems in the message > handler, after your proposed change I'd catch them in `map()`. > >
Re: Spark structured streaming
Thanks Jacek for the pointer. Any idea which package can be used in .format(). The test cases seem to work out of the DefaultSource class defined within the DataFrameReaderWriterSuite [ org.apache.spark.sql.streaming.test.DefaultSource] Thanking You - Praveen Devarao Spark Technology Centre IBM India Software Labs - "Courage doesn't always roar. Sometimes courage is the quiet voice at the end of the day saying I will try again" From: Jacek LaskowskiTo: Praveen Devarao/India/IBM@IBMIN Cc: user , dev Date: 08/03/2016 04:17 pm Subject:Re: Spark structured streaming Hi Praveen, I've spent few hours on the changes related to streaming dataframes (included in the SPARK-8360) and concluded that it's currently only possible to read.stream(), but not write.stream() since there are no streaming Sinks yet. Pozdrawiam, Jacek Laskowski https://medium.com/@jaceklaskowski/ Mastering Apache Spark http://bit.ly/mastering-apache-spark Follow me at https://twitter.com/jaceklaskowski On Tue, Mar 8, 2016 at 10:38 AM, Praveen Devarao wrote: > Hi, > > I would like to get my hands on the structured streaming feature > coming out in Spark 2.0. I have tried looking around for code samples to get > started but am not able to find any. Only few things I could look into is > the test cases that have been committed under the JIRA umbrella > https://issues.apache.org/jira/browse/SPARK-8360but the test cases don't > lead to building a example code as they seem to be working out of internal > classes. > > Could anyone point me to some resources or pointers in code that I > can start with to understand structured streaming from a consumability > angle. > > Thanking You > - > Praveen Devarao > Spark Technology Centre > IBM India Software Labs > - > "Courage doesn't always roar. Sometimes courage is the quiet voice at the > end of the day saying I will try again" - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark structured streaming
Hi Praveen, I've spent few hours on the changes related to streaming dataframes (included in the SPARK-8360) and concluded that it's currently only possible to read.stream(), but not write.stream() since there are no streaming Sinks yet. Pozdrawiam, Jacek Laskowski https://medium.com/@jaceklaskowski/ Mastering Apache Spark http://bit.ly/mastering-apache-spark Follow me at https://twitter.com/jaceklaskowski On Tue, Mar 8, 2016 at 10:38 AM, Praveen Devaraowrote: > Hi, > > I would like to get my hands on the structured streaming feature > coming out in Spark 2.0. I have tried looking around for code samples to get > started but am not able to find any. Only few things I could look into is > the test cases that have been committed under the JIRA umbrella > https://issues.apache.org/jira/browse/SPARK-8360but the test cases don't > lead to building a example code as they seem to be working out of internal > classes. > > Could anyone point me to some resources or pointers in code that I > can start with to understand structured streaming from a consumability > angle. > > Thanking You > - > Praveen Devarao > Spark Technology Centre > IBM India Software Labs > - > "Courage doesn't always roar. Sometimes courage is the quiet voice at the > end of the day saying I will try again" - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Spark structured streaming
Hi, I would like to get my hands on the structured streaming feature coming out in Spark 2.0. I have tried looking around for code samples to get started but am not able to find any. Only few things I could look into is the test cases that have been committed under the JIRA umbrella https://issues.apache.org/jira/browse/SPARK-8360 but the test cases don't lead to building a example code as they seem to be working out of internal classes. Could anyone point me to some resources or pointers in code that I can start with to understand structured streaming from a consumability angle. Thanking You - Praveen Devarao Spark Technology Centre IBM India Software Labs - "Courage doesn't always roar. Sometimes courage is the quiet voice at the end of the day saying I will try again"
Re: ML ALS API
Hi Maciej Yes, that *train* method is intended to be public, but it is marked as *DeveloperApi*, which means that backward compatibility is not necessarily guaranteed, and that method may change. Having said that, even APIs marked as DeveloperApi do tend to be relatively stable. As the comment mentions: * :: DeveloperApi :: * An implementation of ALS that supports *generic ID types*, specialized for Int and Long. This is * exposed as a developer API for users who do need other ID types. But it is not recommended * because it increases the shuffle size and memory requirement during training. This *train* method is intended for the use case where user and item ids are not the default Int (e.g. String). As you can see it returns the factor RDDs directly, as opposed to an ALSModel instance, so overall it is a little less user-friendly. The *Float* ratings are to save space and make ALS more efficient overall. That will not change in 2.0+ (especially since the precision of ratings is not very important). Hope that helps. On Tue, 8 Mar 2016 at 08:20 Maciej Szymkiewiczwrote: > Can I ask for a clarifications regarding ml.recommendation.ALS: > > - is train method > ( > https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala#L598 > ) > intended to be be public? > - Rating class > ( > https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala#L436)is > using float instead of double like its MLLib counterpart. Is it going to > be a default encoding in 2.0+? > > -- > Best, > Maciej Szymkiewicz > > >
Re: BUILD FAILURE due to...Unable to find configuration file at location dev/scalastyle-config.xml
Hi, I updated PR https://github.com/apache/spark/pull/11567. But, `lint-java` fails if that file is in the dev folder. (Jenkins fails, too.) So, inevitably, I changed pom.xml instead. Dongjoon. On Mon, Mar 7, 2016 at 11:40 PM, Jacek Laskowskiwrote: > Hi, > > At first glance it appears the commit *yesterday* (Warsaw time) broke > the build :( > > > https://github.com/apache/spark/commit/0eea12a3d956b54bbbd73d21b296868852a04494 > > > Pozdrawiam, > Jacek Laskowski > > https://medium.com/@jaceklaskowski/ > Mastering Apache Spark http://bit.ly/mastering-apache-spark > Follow me at https://twitter.com/jaceklaskowski > > > On Tue, Mar 8, 2016 at 8:38 AM, Jacek Laskowski wrote: > > Hi, > > > > Got the BUILD FAILURE. Anyone looking into it? > > > > ➜ spark git:(master) ✗ ./build/mvn -Pyarn -Phadoop-2.6 > > -Dhadoop.version=2.7.2 -Phive -Phive-thriftserver -DskipTests clean > > install > > ... > > [INFO] > > > [INFO] BUILD FAILURE > > [INFO] > > > [INFO] Total time: 2.837 s > > [INFO] Finished at: 2016-03-08T08:19:36+01:00 > > [INFO] Final Memory: 50M/581M > > [INFO] > > > [ERROR] Failed to execute goal > > org.scalastyle:scalastyle-maven-plugin:0.8.0:check (default) on > > project spark-parent_2.11: Failed during scalastyle execution: Unable > > to find configuration file at location dev/scalastyle-config.xml -> > > [Help 1] > > [ERROR] > > [ERROR] To see the full stack trace of the errors, re-run Maven with > > the -e switch. > > [ERROR] Re-run Maven using the -X switch to enable full debug logging. > > [ERROR] > > [ERROR] For more information about the errors and possible solutions, > > please read the following articles: > > [ERROR] [Help 1] > > http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionException > > > > Pozdrawiam, > > Jacek Laskowski > > > > https://medium.com/@jaceklaskowski/ > > Mastering Apache Spark http://bit.ly/mastering-apache-spark > > Follow me at https://twitter.com/jaceklaskowski > > - > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org > For additional commands, e-mail: dev-h...@spark.apache.org > >
Re: BUILD FAILURE due to...Unable to find configuration file at location dev/scalastyle-config.xml
Hi, At first glance it appears the commit *yesterday* (Warsaw time) broke the build :( https://github.com/apache/spark/commit/0eea12a3d956b54bbbd73d21b296868852a04494 Pozdrawiam, Jacek Laskowski https://medium.com/@jaceklaskowski/ Mastering Apache Spark http://bit.ly/mastering-apache-spark Follow me at https://twitter.com/jaceklaskowski On Tue, Mar 8, 2016 at 8:38 AM, Jacek Laskowskiwrote: > Hi, > > Got the BUILD FAILURE. Anyone looking into it? > > ➜ spark git:(master) ✗ ./build/mvn -Pyarn -Phadoop-2.6 > -Dhadoop.version=2.7.2 -Phive -Phive-thriftserver -DskipTests clean > install > ... > [INFO] > > [INFO] BUILD FAILURE > [INFO] > > [INFO] Total time: 2.837 s > [INFO] Finished at: 2016-03-08T08:19:36+01:00 > [INFO] Final Memory: 50M/581M > [INFO] > > [ERROR] Failed to execute goal > org.scalastyle:scalastyle-maven-plugin:0.8.0:check (default) on > project spark-parent_2.11: Failed during scalastyle execution: Unable > to find configuration file at location dev/scalastyle-config.xml -> > [Help 1] > [ERROR] > [ERROR] To see the full stack trace of the errors, re-run Maven with > the -e switch. > [ERROR] Re-run Maven using the -X switch to enable full debug logging. > [ERROR] > [ERROR] For more information about the errors and possible solutions, > please read the following articles: > [ERROR] [Help 1] > http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionException > > Pozdrawiam, > Jacek Laskowski > > https://medium.com/@jaceklaskowski/ > Mastering Apache Spark http://bit.ly/mastering-apache-spark > Follow me at https://twitter.com/jaceklaskowski - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org