Re: Spark Scheduler creating Straggler Node

2016-03-08 Thread Prabhu Joseph
I don't just want to replicate all Cached Blocks. I am trying to find a way
to solve the issue which i mentioned above mail. Having replicas for all
cached blocks will add more cost to customers.



On Wed, Mar 9, 2016 at 9:50 AM, Reynold Xin  wrote:

> You just want to be able to replicate hot cached blocks right?
>
>
> On Tuesday, March 8, 2016, Prabhu Joseph 
> wrote:
>
>> Hi All,
>>
>> When a Spark Job is running, and one of the Spark Executor on Node A
>> has some partitions cached. Later for some other stage, Scheduler tries to
>> assign a task to Node A to process a cached partition (PROCESS_LOCAL). But
>> meanwhile the Node A is occupied with some other
>> tasks and got busy. Scheduler waits for spark.locality.wait interval and
>> times out and tries to find some other node B which is NODE_LOCAL. The
>> executor on Node B will try to get the cached partition from Node A which
>> adds network IO to node and also some extra CPU for I/O. Eventually,
>> every node will have a task that is waiting to fetch some cached
>> partition from node A and so the spark job / cluster is basically blocked
>> on a single node.
>>
>> Spark JIRA is created https://issues.apache.org/jira/browse/SPARK-13718
>>
>> Beginning from Spark 1.2, Spark introduced External Shuffle Service to
>> enable executors fetch shuffle files from an external service instead of
>> from each other which will offload the load on Spark Executors.
>>
>> We want to check whether a similar thing of an External Service is
>> implemented for transferring the cached partition to other executors.
>>
>>
>> Thanks, Prabhu Joseph
>>
>>
>>


Re: Inconsistent file extensions and omitting file extensions written by CSV, TEXT and JSON data sources.

2016-03-08 Thread Reynold Xin
Isn't this just specified by the user?


On Tue, Mar 8, 2016 at 9:49 PM, Hyukjin Kwon  wrote:

> Hi all,
>
> Currently, the output from CSV, TEXT and JSON data sources does not have
> file extensions such as .csv, .txt and .json (except for compression
> extensions such as .gz, .deflate and .bz4).
>
> In addition, it looks Parquet has the extensions such as .gz.parquet or
> .snappy.parquet according to compression codecs whereas ORC does not have
> such extensions but it is just .orc.
>
> I tried to search some JIRAs related with this but I could not find yet
> but I did not open a JIRA directly because I feel like this is already
> concerned
>
> Maybe could I open a JIRA for this inconsistent file extensions?
>
> It would be thankful if you give me some feedback
>
> Thanks!
>


Inconsistent file extensions and omitting file extensions written by CSV, TEXT and JSON data sources.

2016-03-08 Thread Hyukjin Kwon
Hi all,

Currently, the output from CSV, TEXT and JSON data sources does not have
file extensions such as .csv, .txt and .json (except for compression
extensions such as .gz, .deflate and .bz4).

In addition, it looks Parquet has the extensions such as .gz.parquet or
.snappy.parquet according to compression codecs whereas ORC does not have
such extensions but it is just .orc.

I tried to search some JIRAs related with this but I could not find yet
but I did not open a JIRA directly because I feel like this is already
concerned

Maybe could I open a JIRA for this inconsistent file extensions?

It would be thankful if you give me some feedback

Thanks!


Re: Spark Scheduler creating Straggler Node

2016-03-08 Thread Reynold Xin
You just want to be able to replicate hot cached blocks right?

On Tuesday, March 8, 2016, Prabhu Joseph  wrote:

> Hi All,
>
> When a Spark Job is running, and one of the Spark Executor on Node A
> has some partitions cached. Later for some other stage, Scheduler tries to
> assign a task to Node A to process a cached partition (PROCESS_LOCAL). But
> meanwhile the Node A is occupied with some other
> tasks and got busy. Scheduler waits for spark.locality.wait interval and
> times out and tries to find some other node B which is NODE_LOCAL. The
> executor on Node B will try to get the cached partition from Node A which
> adds network IO to node and also some extra CPU for I/O. Eventually,
> every node will have a task that is waiting to fetch some cached partition
> from node A and so the spark job / cluster is basically blocked on a single
> node.
>
> Spark JIRA is created https://issues.apache.org/jira/browse/SPARK-13718
>
> Beginning from Spark 1.2, Spark introduced External Shuffle Service to
> enable executors fetch shuffle files from an external service instead of
> from each other which will offload the load on Spark Executors.
>
> We want to check whether a similar thing of an External Service is
> implemented for transferring the cached partition to other executors.
>
>
> Thanks, Prabhu Joseph
>
>
>


Re: [VOTE] Release Apache Spark 1.6.1 (RC1)

2016-03-08 Thread Burak Yavuz
+1

On Tue, Mar 8, 2016 at 10:59 AM, Andrew Or  wrote:

> +1
>
> 2016-03-08 10:59 GMT-08:00 Yin Huai :
>
>> +1
>>
>> On Mon, Mar 7, 2016 at 12:39 PM, Reynold Xin  wrote:
>>
>>> +1 (binding)
>>>
>>>
>>> On Sun, Mar 6, 2016 at 12:08 PM, Egor Pahomov 
>>> wrote:
>>>
 +1

 Spark ODBC server is fine, SQL is fine.

 2016-03-03 12:09 GMT-08:00 Yin Yang :

> Skipping docker tests, the rest are green:
>
> [INFO] Spark Project External Kafka ... SUCCESS
> [01:28 min]
> [INFO] Spark Project Examples . SUCCESS
> [02:59 min]
> [INFO] Spark Project External Kafka Assembly .. SUCCESS [
> 11.680 s]
> [INFO]
> 
> [INFO] BUILD SUCCESS
> [INFO]
> 
> [INFO] Total time: 02:16 h
> [INFO] Finished at: 2016-03-03T11:17:07-08:00
> [INFO] Final Memory: 152M/4062M
>
> On Thu, Mar 3, 2016 at 8:55 AM, Yin Yang  wrote:
>
>> When I ran test suite using the following command:
>>
>> build/mvn clean -Phive -Phive-thriftserver -Pyarn -Phadoop-2.6
>> -Dhadoop.version=2.7.0 package
>>
>> I got failure in Spark Project Docker Integration Tests :
>>
>> 16/03/02 17:36:46 INFO RemoteActorRefProvider$RemotingTerminator:
>> Remote daemon shut down; proceeding with flushing remote transports.
>> ^[[31m*** RUN ABORTED ***^[[0m
>> ^[[31m  com.spotify.docker.client.DockerException:
>> java.util.concurrent.ExecutionException:
>> com.spotify.docker.client.shaded.javax.ws.rs.ProcessingException:
>> java.io.IOException: No such file or directory^[[0m
>> ^[[31m  at
>> com.spotify.docker.client.DefaultDockerClient.propagate(DefaultDockerClient.java:1141)^[[0m
>> ^[[31m  at
>> com.spotify.docker.client.DefaultDockerClient.request(DefaultDockerClient.java:1082)^[[0m
>> ^[[31m  at
>> com.spotify.docker.client.DefaultDockerClient.ping(DefaultDockerClient.java:281)^[[0m
>> ^[[31m  at
>> org.apache.spark.sql.jdbc.DockerJDBCIntegrationSuite.beforeAll(DockerJDBCIntegrationSuite.scala:76)^[[0m
>> ^[[31m  at
>> org.scalatest.BeforeAndAfterAll$class.beforeAll(BeforeAndAfterAll.scala:187)^[[0m
>> ^[[31m  at
>> org.apache.spark.sql.jdbc.DockerJDBCIntegrationSuite.beforeAll(DockerJDBCIntegrationSuite.scala:58)^[[0m
>> ^[[31m  at
>> org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:253)^[[0m
>> ^[[31m  at
>> org.apache.spark.sql.jdbc.DockerJDBCIntegrationSuite.run(DockerJDBCIntegrationSuite.scala:58)^[[0m
>> ^[[31m  at
>> org.scalatest.Suite$class.callExecuteOnSuite$1(Suite.scala:1492)^[[0m
>> ^[[31m  at
>> org.scalatest.Suite$$anonfun$runNestedSuites$1.apply(Suite.scala:1528)^[[0m
>> ^[[31m  ...^[[0m
>> ^[[31m  Cause: java.util.concurrent.ExecutionException:
>> com.spotify.docker.client.shaded.javax.ws.rs.ProcessingException:
>> java.io.IOException: No such file or directory^[[0m
>> ^[[31m  at
>> jersey.repackaged.com.google.common.util.concurrent.AbstractFuture$Sync.getValue(AbstractFuture.java:299)^[[0m
>> ^[[31m  at
>> jersey.repackaged.com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:286)^[[0m
>> ^[[31m  at
>> jersey.repackaged.com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:116)^[[0m
>> ^[[31m  at
>> com.spotify.docker.client.DefaultDockerClient.request(DefaultDockerClient.java:1080)^[[0m
>> ^[[31m  at
>> com.spotify.docker.client.DefaultDockerClient.ping(DefaultDockerClient.java:281)^[[0m
>> ^[[31m  at
>> org.apache.spark.sql.jdbc.DockerJDBCIntegrationSuite.beforeAll(DockerJDBCIntegrationSuite.scala:76)^[[0m
>> ^[[31m  at
>> org.scalatest.BeforeAndAfterAll$class.beforeAll(BeforeAndAfterAll.scala:187)^[[0m
>> ^[[31m  at
>> org.apache.spark.sql.jdbc.DockerJDBCIntegrationSuite.beforeAll(DockerJDBCIntegrationSuite.scala:58)^[[0m
>> ^[[31m  at
>> org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:253)^[[0m
>> ^[[31m  at
>> org.apache.spark.sql.jdbc.DockerJDBCIntegrationSuite.run(DockerJDBCIntegrationSuite.scala:58)^[[0m
>> ^[[31m  ...^[[0m
>> ^[[31m  Cause:
>> com.spotify.docker.client.shaded.javax.ws.rs.ProcessingException:
>> java.io.IOException: No such file or directory^[[0m
>> ^[[31m  at
>> org.glassfish.jersey.apache.connector.ApacheConnector.apply(ApacheConnector.java:481)^[[0m
>> ^[[31m  at
>> org.glassfish.jersey.apache.connector.ApacheConnector$1.run(ApacheConnector.java:491)^[[0m
>> ^[[31m  at
>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)^[[0m

Re: [VOTE] Release Apache Spark 1.6.1 (RC1)

2016-03-08 Thread Andrew Or
+1

2016-03-08 10:59 GMT-08:00 Yin Huai :

> +1
>
> On Mon, Mar 7, 2016 at 12:39 PM, Reynold Xin  wrote:
>
>> +1 (binding)
>>
>>
>> On Sun, Mar 6, 2016 at 12:08 PM, Egor Pahomov 
>> wrote:
>>
>>> +1
>>>
>>> Spark ODBC server is fine, SQL is fine.
>>>
>>> 2016-03-03 12:09 GMT-08:00 Yin Yang :
>>>
 Skipping docker tests, the rest are green:

 [INFO] Spark Project External Kafka ... SUCCESS
 [01:28 min]
 [INFO] Spark Project Examples . SUCCESS
 [02:59 min]
 [INFO] Spark Project External Kafka Assembly .. SUCCESS [
 11.680 s]
 [INFO]
 
 [INFO] BUILD SUCCESS
 [INFO]
 
 [INFO] Total time: 02:16 h
 [INFO] Finished at: 2016-03-03T11:17:07-08:00
 [INFO] Final Memory: 152M/4062M

 On Thu, Mar 3, 2016 at 8:55 AM, Yin Yang  wrote:

> When I ran test suite using the following command:
>
> build/mvn clean -Phive -Phive-thriftserver -Pyarn -Phadoop-2.6
> -Dhadoop.version=2.7.0 package
>
> I got failure in Spark Project Docker Integration Tests :
>
> 16/03/02 17:36:46 INFO RemoteActorRefProvider$RemotingTerminator:
> Remote daemon shut down; proceeding with flushing remote transports.
> ^[[31m*** RUN ABORTED ***^[[0m
> ^[[31m  com.spotify.docker.client.DockerException:
> java.util.concurrent.ExecutionException:
> com.spotify.docker.client.shaded.javax.ws.rs.ProcessingException:
> java.io.IOException: No such file or directory^[[0m
> ^[[31m  at
> com.spotify.docker.client.DefaultDockerClient.propagate(DefaultDockerClient.java:1141)^[[0m
> ^[[31m  at
> com.spotify.docker.client.DefaultDockerClient.request(DefaultDockerClient.java:1082)^[[0m
> ^[[31m  at
> com.spotify.docker.client.DefaultDockerClient.ping(DefaultDockerClient.java:281)^[[0m
> ^[[31m  at
> org.apache.spark.sql.jdbc.DockerJDBCIntegrationSuite.beforeAll(DockerJDBCIntegrationSuite.scala:76)^[[0m
> ^[[31m  at
> org.scalatest.BeforeAndAfterAll$class.beforeAll(BeforeAndAfterAll.scala:187)^[[0m
> ^[[31m  at
> org.apache.spark.sql.jdbc.DockerJDBCIntegrationSuite.beforeAll(DockerJDBCIntegrationSuite.scala:58)^[[0m
> ^[[31m  at
> org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:253)^[[0m
> ^[[31m  at
> org.apache.spark.sql.jdbc.DockerJDBCIntegrationSuite.run(DockerJDBCIntegrationSuite.scala:58)^[[0m
> ^[[31m  at
> org.scalatest.Suite$class.callExecuteOnSuite$1(Suite.scala:1492)^[[0m
> ^[[31m  at
> org.scalatest.Suite$$anonfun$runNestedSuites$1.apply(Suite.scala:1528)^[[0m
> ^[[31m  ...^[[0m
> ^[[31m  Cause: java.util.concurrent.ExecutionException:
> com.spotify.docker.client.shaded.javax.ws.rs.ProcessingException:
> java.io.IOException: No such file or directory^[[0m
> ^[[31m  at
> jersey.repackaged.com.google.common.util.concurrent.AbstractFuture$Sync.getValue(AbstractFuture.java:299)^[[0m
> ^[[31m  at
> jersey.repackaged.com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:286)^[[0m
> ^[[31m  at
> jersey.repackaged.com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:116)^[[0m
> ^[[31m  at
> com.spotify.docker.client.DefaultDockerClient.request(DefaultDockerClient.java:1080)^[[0m
> ^[[31m  at
> com.spotify.docker.client.DefaultDockerClient.ping(DefaultDockerClient.java:281)^[[0m
> ^[[31m  at
> org.apache.spark.sql.jdbc.DockerJDBCIntegrationSuite.beforeAll(DockerJDBCIntegrationSuite.scala:76)^[[0m
> ^[[31m  at
> org.scalatest.BeforeAndAfterAll$class.beforeAll(BeforeAndAfterAll.scala:187)^[[0m
> ^[[31m  at
> org.apache.spark.sql.jdbc.DockerJDBCIntegrationSuite.beforeAll(DockerJDBCIntegrationSuite.scala:58)^[[0m
> ^[[31m  at
> org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:253)^[[0m
> ^[[31m  at
> org.apache.spark.sql.jdbc.DockerJDBCIntegrationSuite.run(DockerJDBCIntegrationSuite.scala:58)^[[0m
> ^[[31m  ...^[[0m
> ^[[31m  Cause:
> com.spotify.docker.client.shaded.javax.ws.rs.ProcessingException:
> java.io.IOException: No such file or directory^[[0m
> ^[[31m  at
> org.glassfish.jersey.apache.connector.ApacheConnector.apply(ApacheConnector.java:481)^[[0m
> ^[[31m  at
> org.glassfish.jersey.apache.connector.ApacheConnector$1.run(ApacheConnector.java:491)^[[0m
> ^[[31m  at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)^[[0m
> ^[[31m  at
> java.util.concurrent.FutureTask.run(FutureTask.java:262)^[[0m
> ^[[31m  at
> 

Re: [VOTE] Release Apache Spark 1.6.1 (RC1)

2016-03-08 Thread Yin Huai
+1

On Mon, Mar 7, 2016 at 12:39 PM, Reynold Xin  wrote:

> +1 (binding)
>
>
> On Sun, Mar 6, 2016 at 12:08 PM, Egor Pahomov 
> wrote:
>
>> +1
>>
>> Spark ODBC server is fine, SQL is fine.
>>
>> 2016-03-03 12:09 GMT-08:00 Yin Yang :
>>
>>> Skipping docker tests, the rest are green:
>>>
>>> [INFO] Spark Project External Kafka ... SUCCESS
>>> [01:28 min]
>>> [INFO] Spark Project Examples . SUCCESS
>>> [02:59 min]
>>> [INFO] Spark Project External Kafka Assembly .. SUCCESS [
>>> 11.680 s]
>>> [INFO]
>>> 
>>> [INFO] BUILD SUCCESS
>>> [INFO]
>>> 
>>> [INFO] Total time: 02:16 h
>>> [INFO] Finished at: 2016-03-03T11:17:07-08:00
>>> [INFO] Final Memory: 152M/4062M
>>>
>>> On Thu, Mar 3, 2016 at 8:55 AM, Yin Yang  wrote:
>>>
 When I ran test suite using the following command:

 build/mvn clean -Phive -Phive-thriftserver -Pyarn -Phadoop-2.6
 -Dhadoop.version=2.7.0 package

 I got failure in Spark Project Docker Integration Tests :

 16/03/02 17:36:46 INFO RemoteActorRefProvider$RemotingTerminator:
 Remote daemon shut down; proceeding with flushing remote transports.
 ^[[31m*** RUN ABORTED ***^[[0m
 ^[[31m  com.spotify.docker.client.DockerException:
 java.util.concurrent.ExecutionException:
 com.spotify.docker.client.shaded.javax.ws.rs.ProcessingException:
 java.io.IOException: No such file or directory^[[0m
 ^[[31m  at
 com.spotify.docker.client.DefaultDockerClient.propagate(DefaultDockerClient.java:1141)^[[0m
 ^[[31m  at
 com.spotify.docker.client.DefaultDockerClient.request(DefaultDockerClient.java:1082)^[[0m
 ^[[31m  at
 com.spotify.docker.client.DefaultDockerClient.ping(DefaultDockerClient.java:281)^[[0m
 ^[[31m  at
 org.apache.spark.sql.jdbc.DockerJDBCIntegrationSuite.beforeAll(DockerJDBCIntegrationSuite.scala:76)^[[0m
 ^[[31m  at
 org.scalatest.BeforeAndAfterAll$class.beforeAll(BeforeAndAfterAll.scala:187)^[[0m
 ^[[31m  at
 org.apache.spark.sql.jdbc.DockerJDBCIntegrationSuite.beforeAll(DockerJDBCIntegrationSuite.scala:58)^[[0m
 ^[[31m  at
 org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:253)^[[0m
 ^[[31m  at
 org.apache.spark.sql.jdbc.DockerJDBCIntegrationSuite.run(DockerJDBCIntegrationSuite.scala:58)^[[0m
 ^[[31m  at
 org.scalatest.Suite$class.callExecuteOnSuite$1(Suite.scala:1492)^[[0m
 ^[[31m  at
 org.scalatest.Suite$$anonfun$runNestedSuites$1.apply(Suite.scala:1528)^[[0m
 ^[[31m  ...^[[0m
 ^[[31m  Cause: java.util.concurrent.ExecutionException:
 com.spotify.docker.client.shaded.javax.ws.rs.ProcessingException:
 java.io.IOException: No such file or directory^[[0m
 ^[[31m  at
 jersey.repackaged.com.google.common.util.concurrent.AbstractFuture$Sync.getValue(AbstractFuture.java:299)^[[0m
 ^[[31m  at
 jersey.repackaged.com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:286)^[[0m
 ^[[31m  at
 jersey.repackaged.com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:116)^[[0m
 ^[[31m  at
 com.spotify.docker.client.DefaultDockerClient.request(DefaultDockerClient.java:1080)^[[0m
 ^[[31m  at
 com.spotify.docker.client.DefaultDockerClient.ping(DefaultDockerClient.java:281)^[[0m
 ^[[31m  at
 org.apache.spark.sql.jdbc.DockerJDBCIntegrationSuite.beforeAll(DockerJDBCIntegrationSuite.scala:76)^[[0m
 ^[[31m  at
 org.scalatest.BeforeAndAfterAll$class.beforeAll(BeforeAndAfterAll.scala:187)^[[0m
 ^[[31m  at
 org.apache.spark.sql.jdbc.DockerJDBCIntegrationSuite.beforeAll(DockerJDBCIntegrationSuite.scala:58)^[[0m
 ^[[31m  at
 org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:253)^[[0m
 ^[[31m  at
 org.apache.spark.sql.jdbc.DockerJDBCIntegrationSuite.run(DockerJDBCIntegrationSuite.scala:58)^[[0m
 ^[[31m  ...^[[0m
 ^[[31m  Cause:
 com.spotify.docker.client.shaded.javax.ws.rs.ProcessingException:
 java.io.IOException: No such file or directory^[[0m
 ^[[31m  at
 org.glassfish.jersey.apache.connector.ApacheConnector.apply(ApacheConnector.java:481)^[[0m
 ^[[31m  at
 org.glassfish.jersey.apache.connector.ApacheConnector$1.run(ApacheConnector.java:491)^[[0m
 ^[[31m  at
 java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)^[[0m
 ^[[31m  at java.util.concurrent.FutureTask.run(FutureTask.java:262)^[[0m
 ^[[31m  at
 jersey.repackaged.com.google.common.util.concurrent.MoreExecutors$DirectExecutorService.execute(MoreExecutors.java:299)^[[0m
 ^[[31m  at
 java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:110)^[[0m
 ^[[31m  at
 

Re: Spark structured streaming

2016-03-08 Thread Michael Armbrust
This is in active development, so there is not much that can be done from
an end user perspective.  In particular the only sink that is available in
apache/master is a testing sink that just stores the data in memory.  We
are working on a parquet based file sink and will eventually support all
the of Data Source API file formats (text, json, csv, orc, parquet).

On Tue, Mar 8, 2016 at 7:38 AM, Jacek Laskowski  wrote:

> Hi Praveen,
>
> I don't really know. I think TD or Michael should know as they
> personally involved in the task (as far as I could figure it out from
> the JIRA and the changes). Ping people on the JIRA so they notice your
> question(s).
>
> Pozdrawiam,
> Jacek Laskowski
> 
> https://medium.com/@jaceklaskowski/
> Mastering Apache Spark http://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski
>
>
> On Tue, Mar 8, 2016 at 12:32 PM, Praveen Devarao 
> wrote:
> > Thanks Jacek for the pointer.
> >
> > Any idea which package can be used in .format(). The test cases seem to
> work
> > out of the DefaultSource class defined within the
> DataFrameReaderWriterSuite
> > [org.apache.spark.sql.streaming.test.DefaultSource]
> >
> > Thanking You
> >
> -
> > Praveen Devarao
> > Spark Technology Centre
> > IBM India Software Labs
> >
> -
> > "Courage doesn't always roar. Sometimes courage is the quiet voice at the
> > end of the day saying I will try again"
> >
> >
> >
> > From:Jacek Laskowski 
> > To:Praveen Devarao/India/IBM@IBMIN
> > Cc:user , dev 
> > Date:08/03/2016 04:17 pm
> > Subject:Re: Spark structured streaming
> > 
> >
> >
> >
> > Hi Praveen,
> >
> > I've spent few hours on the changes related to streaming dataframes
> > (included in the SPARK-8360) and concluded that it's currently only
> > possible to read.stream(), but not write.stream() since there are no
> > streaming Sinks yet.
> >
> > Pozdrawiam,
> > Jacek Laskowski
> > 
> > https://medium.com/@jaceklaskowski/
> > Mastering Apache Spark http://bit.ly/mastering-apache-spark
> > Follow me at https://twitter.com/jaceklaskowski
> >
> >
> > On Tue, Mar 8, 2016 at 10:38 AM, Praveen Devarao 
> > wrote:
> >> Hi,
> >>
> >> I would like to get my hands on the structured streaming feature
> >> coming out in Spark 2.0. I have tried looking around for code samples to
> >> get
> >> started but am not able to find any. Only few things I could look into
> is
> >> the test cases that have been committed under the JIRA umbrella
> >> https://issues.apache.org/jira/browse/SPARK-8360butthe test cases don't
> >> lead to building a example code as they seem to be working out of
> internal
> >> classes.
> >>
> >> Could anyone point me to some resources or pointers in code
> that I
> >> can start with to understand structured streaming from a consumability
> >> angle.
> >>
> >> Thanking You
> >>
> >>
> -
> >> Praveen Devarao
> >> Spark Technology Centre
> >> IBM India Software Labs
> >>
> >>
> -
> >> "Courage doesn't always roar. Sometimes courage is the quiet voice at
> the
> >> end of the day saying I will try again"
> >
> > -
> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> > For additional commands, e-mail: user-h...@spark.apache.org
> >
> >
> >
> >
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Spark structured streaming

2016-03-08 Thread Jacek Laskowski
Hi Praveen,

I don't really know. I think TD or Michael should know as they
personally involved in the task (as far as I could figure it out from
the JIRA and the changes). Ping people on the JIRA so they notice your
question(s).

Pozdrawiam,
Jacek Laskowski

https://medium.com/@jaceklaskowski/
Mastering Apache Spark http://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski


On Tue, Mar 8, 2016 at 12:32 PM, Praveen Devarao  wrote:
> Thanks Jacek for the pointer.
>
> Any idea which package can be used in .format(). The test cases seem to work
> out of the DefaultSource class defined within the DataFrameReaderWriterSuite
> [org.apache.spark.sql.streaming.test.DefaultSource]
>
> Thanking You
> -
> Praveen Devarao
> Spark Technology Centre
> IBM India Software Labs
> -
> "Courage doesn't always roar. Sometimes courage is the quiet voice at the
> end of the day saying I will try again"
>
>
>
> From:Jacek Laskowski 
> To:Praveen Devarao/India/IBM@IBMIN
> Cc:user , dev 
> Date:08/03/2016 04:17 pm
> Subject:Re: Spark structured streaming
> 
>
>
>
> Hi Praveen,
>
> I've spent few hours on the changes related to streaming dataframes
> (included in the SPARK-8360) and concluded that it's currently only
> possible to read.stream(), but not write.stream() since there are no
> streaming Sinks yet.
>
> Pozdrawiam,
> Jacek Laskowski
> 
> https://medium.com/@jaceklaskowski/
> Mastering Apache Spark http://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski
>
>
> On Tue, Mar 8, 2016 at 10:38 AM, Praveen Devarao 
> wrote:
>> Hi,
>>
>> I would like to get my hands on the structured streaming feature
>> coming out in Spark 2.0. I have tried looking around for code samples to
>> get
>> started but am not able to find any. Only few things I could look into is
>> the test cases that have been committed under the JIRA umbrella
>> https://issues.apache.org/jira/browse/SPARK-8360butthe test cases don't
>> lead to building a example code as they seem to be working out of internal
>> classes.
>>
>> Could anyone point me to some resources or pointers in code that I
>> can start with to understand structured streaming from a consumability
>> angle.
>>
>> Thanking You
>>
>> -
>> Praveen Devarao
>> Spark Technology Centre
>> IBM India Software Labs
>>
>> -
>> "Courage doesn't always roar. Sometimes courage is the quiet voice at the
>> end of the day saying I will try again"
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>
>
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Use cases for kafka direct stream messageHandler

2016-03-08 Thread Cody Koeninger
No, looks like you'd have to catch them in the serializer and have the
serializer return option or something. The new consumer builds a buffer
full of records, not one at a time.
On Mar 8, 2016 4:43 AM, "Marius Soutier"  wrote:

>
> > On 04.03.2016, at 22:39, Cody Koeninger  wrote:
> >
> > The only other valid use of messageHandler that I can think of is
> > catching serialization problems on a per-message basis.  But with the
> > new Kafka consumer library, that doesn't seem feasible anyway, and
> > could be handled with a custom (de)serializer.
>
> What do you mean, that doesn't seem feasible? You mean when using a custom
> deserializer? Right now I'm catching serialization problems in the message
> handler, after your proposed change I'd catch them in `map()`.
>
>


Re: Spark structured streaming

2016-03-08 Thread Praveen Devarao
Thanks Jacek for the pointer.

Any idea which package can be used in .format(). The test cases seem to 
work out of the DefaultSource class defined within the 
DataFrameReaderWriterSuite [
org.apache.spark.sql.streaming.test.DefaultSource]

Thanking You
-
Praveen Devarao
Spark Technology Centre
IBM India Software Labs
-
"Courage doesn't always roar. Sometimes courage is the quiet voice at the 
end of the day saying I will try again"



From:   Jacek Laskowski 
To: Praveen Devarao/India/IBM@IBMIN
Cc: user , dev 
Date:   08/03/2016 04:17 pm
Subject:Re: Spark structured streaming



Hi Praveen,

I've spent few hours on the changes related to streaming dataframes
(included in the SPARK-8360) and concluded that it's currently only
possible to read.stream(), but not write.stream() since there are no
streaming Sinks yet.

Pozdrawiam,
Jacek Laskowski

https://medium.com/@jaceklaskowski/
Mastering Apache Spark http://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski


On Tue, Mar 8, 2016 at 10:38 AM, Praveen Devarao  
wrote:
> Hi,
>
> I would like to get my hands on the structured streaming feature
> coming out in Spark 2.0. I have tried looking around for code samples to 
get
> started but am not able to find any. Only few things I could look into 
is
> the test cases that have been committed under the JIRA umbrella
> https://issues.apache.org/jira/browse/SPARK-8360but the test cases don't
> lead to building a example code as they seem to be working out of 
internal
> classes.
>
> Could anyone point me to some resources or pointers in code that 
I
> can start with to understand structured streaming from a consumability
> angle.
>
> Thanking You
> 
-
> Praveen Devarao
> Spark Technology Centre
> IBM India Software Labs
> 
-
> "Courage doesn't always roar. Sometimes courage is the quiet voice at 
the
> end of the day saying I will try again"

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org







Re: Spark structured streaming

2016-03-08 Thread Jacek Laskowski
Hi Praveen,

I've spent few hours on the changes related to streaming dataframes
(included in the SPARK-8360) and concluded that it's currently only
possible to read.stream(), but not write.stream() since there are no
streaming Sinks yet.

Pozdrawiam,
Jacek Laskowski

https://medium.com/@jaceklaskowski/
Mastering Apache Spark http://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski


On Tue, Mar 8, 2016 at 10:38 AM, Praveen Devarao  wrote:
> Hi,
>
> I would like to get my hands on the structured streaming feature
> coming out in Spark 2.0. I have tried looking around for code samples to get
> started but am not able to find any. Only few things I could look into is
> the test cases that have been committed under the JIRA umbrella
> https://issues.apache.org/jira/browse/SPARK-8360but the test cases don't
> lead to building a example code as they seem to be working out of internal
> classes.
>
> Could anyone point me to some resources or pointers in code that I
> can start with to understand structured streaming from a consumability
> angle.
>
> Thanking You
> -
> Praveen Devarao
> Spark Technology Centre
> IBM India Software Labs
> -
> "Courage doesn't always roar. Sometimes courage is the quiet voice at the
> end of the day saying I will try again"

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Spark structured streaming

2016-03-08 Thread Praveen Devarao
Hi,

I would like to get my hands on the structured streaming feature 
coming out in Spark 2.0. I have tried looking around for code samples to 
get started but am not able to find any. Only few things I could look into 
is the test cases that have been committed under the JIRA umbrella 
https://issues.apache.org/jira/browse/SPARK-8360 but the test cases don't 
lead to building a example code as they seem to be working out of internal 
classes.

Could anyone point me to some resources or pointers in code that I 
can start with to understand structured streaming from a consumability 
angle.

Thanking You
-
Praveen Devarao
Spark Technology Centre
IBM India Software Labs
-
"Courage doesn't always roar. Sometimes courage is the quiet voice at the 
end of the day saying I will try again"



Re: ML ALS API

2016-03-08 Thread Nick Pentreath
Hi Maciej

Yes, that *train* method is intended to be public, but it is marked as
*DeveloperApi*, which means that backward compatibility is not necessarily
guaranteed, and that method may change. Having said that, even APIs marked
as DeveloperApi do tend to be relatively stable.

As the comment mentions:

 * :: DeveloperApi ::
 * An implementation of ALS that supports *generic ID types*, specialized
for Int and Long. This is
 * exposed as a developer API for users who do need other ID types. But it
is not recommended
 * because it increases the shuffle size and memory requirement during
training.

This *train* method is intended for the use case where user and item ids
are not the default Int (e.g. String). As you can see it returns the factor
RDDs directly, as opposed to an ALSModel instance, so overall it is a
little less user-friendly.

The *Float* ratings are to save space and make ALS more efficient overall.
That will not change in 2.0+ (especially since the precision of ratings is
not very important).

Hope that helps.

On Tue, 8 Mar 2016 at 08:20 Maciej Szymkiewicz 
wrote:

> Can I ask for a clarifications regarding ml.recommendation.ALS:
>
> - is train method
> (
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala#L598
> )
> intended to be be public?
> - Rating class
> (
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala#L436)is
> using float instead of double like its MLLib counterpart. Is it going to
> be a default encoding in 2.0+?
>
> --
> Best,
> Maciej Szymkiewicz
>
>
>


Re: BUILD FAILURE due to...Unable to find configuration file at location dev/scalastyle-config.xml

2016-03-08 Thread Dongjoon Hyun
Hi, I updated PR https://github.com/apache/spark/pull/11567.

But, `lint-java` fails if that file is in the dev folder. (Jenkins fails,
too.)

So, inevitably, I changed pom.xml instead.

Dongjoon.


On Mon, Mar 7, 2016 at 11:40 PM, Jacek Laskowski  wrote:

> Hi,
>
> At first glance it appears the commit *yesterday* (Warsaw time) broke
> the build :(
>
>
> https://github.com/apache/spark/commit/0eea12a3d956b54bbbd73d21b296868852a04494
>
>
> Pozdrawiam,
> Jacek Laskowski
> 
> https://medium.com/@jaceklaskowski/
> Mastering Apache Spark http://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski
>
>
> On Tue, Mar 8, 2016 at 8:38 AM, Jacek Laskowski  wrote:
> > Hi,
> >
> > Got the BUILD FAILURE. Anyone looking into it?
> >
> > ➜  spark git:(master) ✗ ./build/mvn -Pyarn -Phadoop-2.6
> > -Dhadoop.version=2.7.2 -Phive -Phive-thriftserver -DskipTests clean
> > install
> > ...
> > [INFO]
> 
> > [INFO] BUILD FAILURE
> > [INFO]
> 
> > [INFO] Total time: 2.837 s
> > [INFO] Finished at: 2016-03-08T08:19:36+01:00
> > [INFO] Final Memory: 50M/581M
> > [INFO]
> 
> > [ERROR] Failed to execute goal
> > org.scalastyle:scalastyle-maven-plugin:0.8.0:check (default) on
> > project spark-parent_2.11: Failed during scalastyle execution: Unable
> > to find configuration file at location dev/scalastyle-config.xml ->
> > [Help 1]
> > [ERROR]
> > [ERROR] To see the full stack trace of the errors, re-run Maven with
> > the -e switch.
> > [ERROR] Re-run Maven using the -X switch to enable full debug logging.
> > [ERROR]
> > [ERROR] For more information about the errors and possible solutions,
> > please read the following articles:
> > [ERROR] [Help 1]
> > http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionException
> >
> > Pozdrawiam,
> > Jacek Laskowski
> > 
> > https://medium.com/@jaceklaskowski/
> > Mastering Apache Spark http://bit.ly/mastering-apache-spark
> > Follow me at https://twitter.com/jaceklaskowski
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


Re: BUILD FAILURE due to...Unable to find configuration file at location dev/scalastyle-config.xml

2016-03-08 Thread Jacek Laskowski
Hi,

At first glance it appears the commit *yesterday* (Warsaw time) broke
the build :(

https://github.com/apache/spark/commit/0eea12a3d956b54bbbd73d21b296868852a04494


Pozdrawiam,
Jacek Laskowski

https://medium.com/@jaceklaskowski/
Mastering Apache Spark http://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski


On Tue, Mar 8, 2016 at 8:38 AM, Jacek Laskowski  wrote:
> Hi,
>
> Got the BUILD FAILURE. Anyone looking into it?
>
> ➜  spark git:(master) ✗ ./build/mvn -Pyarn -Phadoop-2.6
> -Dhadoop.version=2.7.2 -Phive -Phive-thriftserver -DskipTests clean
> install
> ...
> [INFO] 
> 
> [INFO] BUILD FAILURE
> [INFO] 
> 
> [INFO] Total time: 2.837 s
> [INFO] Finished at: 2016-03-08T08:19:36+01:00
> [INFO] Final Memory: 50M/581M
> [INFO] 
> 
> [ERROR] Failed to execute goal
> org.scalastyle:scalastyle-maven-plugin:0.8.0:check (default) on
> project spark-parent_2.11: Failed during scalastyle execution: Unable
> to find configuration file at location dev/scalastyle-config.xml ->
> [Help 1]
> [ERROR]
> [ERROR] To see the full stack trace of the errors, re-run Maven with
> the -e switch.
> [ERROR] Re-run Maven using the -X switch to enable full debug logging.
> [ERROR]
> [ERROR] For more information about the errors and possible solutions,
> please read the following articles:
> [ERROR] [Help 1]
> http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionException
>
> Pozdrawiam,
> Jacek Laskowski
> 
> https://medium.com/@jaceklaskowski/
> Mastering Apache Spark http://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org