Re: Spark streaming with Kinesis broken?

2015-12-11 Thread Nick Pentreath
cc'ing dev list

Ok, looks like when the KCL version was updated in
https://github.com/apache/spark/pull/8957, the AWS SDK version was not,
probably leading to dependency conflict, though as Burak mentions its hard
to debug as no exceptions seem to get thrown... I've tested 1.5.2 locally
and on my 1.5.2 EC2 cluster, and no data is received, and nothing shows up
in driver or worker logs, so any exception is getting swallowed somewhere.

Run starting. Expected test count is: 4
KinesisStreamSuite:
Using endpoint URL https://kinesis.eu-west-1.amazonaws.com for creating
Kinesis streams for tests.
- KinesisUtils API
- RDD generation
- basic operation *** FAILED ***
  The code passed to eventually never returned normally. Attempted 13 times
over 2.04 minutes. Last failure message: Set() did not equal Set(5, 10,
1, 6, 9, 2, 7, 3, 8, 4)
  Data received does not match data sent. (KinesisStreamSuite.scala:188)
- failure recovery *** FAILED ***
  The code passed to eventually never returned normally. Attempted 63 times
over 2.02863831 minutes. Last failure message: isCheckpointPresent
was true, but 0 was not greater than 10. (KinesisStreamSuite.scala:228)
Run completed in 5 minutes, 0 seconds.
Total number of tests run: 4
Suites: completed 1, aborted 0
Tests: succeeded 2, failed 2, canceled 0, ignored 0, pending 0
*** 2 TESTS FAILED ***
[INFO]

[INFO] BUILD FAILURE
[INFO]



KCL 1.3.0 depends on *1.9.37* SDK (
https://github.com/awslabs/amazon-kinesis-client/blob/1.3.0/pom.xml#L26)
while the Spark Kinesis dependency was kept at *1.9.16.*

I've run the integration tests on branch-1.5 (1.5.3-SNAPSHOT) with AWS SDK
1.9.37 and everything works.

Run starting. Expected test count is: 28
KinesisBackedBlockRDDSuite:
Using endpoint URL https://kinesis.eu-west-1.amazonaws.com for creating
Kinesis streams for tests.
- Basic reading from Kinesis
- Read data available in both block manager and Kinesis
- Read data available only in block manager, not in Kinesis
- Read data available only in Kinesis, not in block manager
- Read data available partially in block manager, rest in Kinesis
- Test isBlockValid skips block fetching from block manager
- Test whether RDD is valid after removing blocks from block anager
KinesisStreamSuite:
- KinesisUtils API
- RDD generation
- basic operation
- failure recovery
KinesisReceiverSuite:
- check serializability of SerializableAWSCredentials
- process records including store and checkpoint
- shouldn't store and checkpoint when receiver is stopped
- shouldn't checkpoint when exception occurs during store
- should set checkpoint time to currentTime + checkpoint interval upon
instantiation
- should checkpoint if we have exceeded the checkpoint interval
- shouldn't checkpoint if we have not exceeded the checkpoint interval
- should add to time when advancing checkpoint
- shutdown should checkpoint if the reason is TERMINATE
- shutdown should not checkpoint if the reason is something other than
TERMINATE
- retry success on first attempt
- retry success on second attempt after a Kinesis throttling exception
- retry success on second attempt after a Kinesis dependency exception
- retry failed after a shutdown exception
- retry failed after an invalid state exception
- retry failed after unexpected exception
- retry failed after exhausing all retries
Run completed in 3 minutes, 28 seconds.
Total number of tests run: 28
Suites: completed 4, aborted 0
Tests: succeeded 28, failed 0, canceled 0, ignored 0, pending 0
All tests passed.

So this is a regression in Spark Streaming Kinesis 1.5.2 - @Brian can you
file a JIRA for this?

@dev-list, since KCL brings in AWS SDK dependencies itself, is it necessary
to declare an explicit dependency on aws-java-sdk in the Kinesis POM? Also,
from KCL 1.5.0+, only the relevant components used from the AWS SDKs are
brought in, making things a bit leaner (this can be upgraded in Spark
1.7/2.0 perhaps). All local tests (and integration tests) pass with
removing the explicit dependency and only depending on KCL. Is aws-java-sdk
used anywhere else (AFAIK it is not, but in case I missed something let me
know any good reason to keep the explicit dependency)?

N



On Fri, Dec 11, 2015 at 6:55 AM, Nick Pentreath 
wrote:

> Yeah also the integration tests need to be specifically run - I would have
> thought the contributor would have run those tests and also tested the
> change themselves using live Kinesis :(
>
> —
> Sent from Mailbox 
>
>
> On Fri, Dec 11, 2015 at 6:18 AM, Burak Yavuz  wrote:
>
>> I don't think the Kinesis tests specifically ran when that was merged
>> into 1.5.2 :(
>> https://github.com/apache/spark/pull/8957
>>
>> https://github.com/apache/spark/commit/883bd8fccf83aae7a2a847c9a6ca129fac86e6a3
>>
>> AFAIK pom changes don't trigger 

Re: Spark streaming with Kinesis broken?

2015-12-11 Thread Nick Pentreath
Is that PR against master branch?




S3 read comes from Hadoop / jet3t afaik



—
Sent from Mailbox

On Fri, Dec 11, 2015 at 5:38 PM, Brian London 
wrote:

> That's good news  I've got a PR in to up the SDK version to 1.10.40 and the
> KCL to 1.6.1 which I'm running tests on locally now.
> Is the AWS SDK not used for reading/writing from S3 or do we get that for
> free from the Hadoop dependencies?
> On Fri, Dec 11, 2015 at 5:07 AM Nick Pentreath 
> wrote:
>> cc'ing dev list
>>
>> Ok, looks like when the KCL version was updated in
>> https://github.com/apache/spark/pull/8957, the AWS SDK version was not,
>> probably leading to dependency conflict, though as Burak mentions its hard
>> to debug as no exceptions seem to get thrown... I've tested 1.5.2 locally
>> and on my 1.5.2 EC2 cluster, and no data is received, and nothing shows up
>> in driver or worker logs, so any exception is getting swallowed somewhere.
>>
>> Run starting. Expected test count is: 4
>> KinesisStreamSuite:
>> Using endpoint URL https://kinesis.eu-west-1.amazonaws.com for creating
>> Kinesis streams for tests.
>> - KinesisUtils API
>> - RDD generation
>> - basic operation *** FAILED ***
>>   The code passed to eventually never returned normally. Attempted 13
>> times over 2.04 minutes. Last failure message: Set() did not equal
>> Set(5, 10, 1, 6, 9, 2, 7, 3, 8, 4)
>>   Data received does not match data sent. (KinesisStreamSuite.scala:188)
>> - failure recovery *** FAILED ***
>>   The code passed to eventually never returned normally. Attempted 63
>> times over 2.02863831 minutes. Last failure message:
>> isCheckpointPresent was true, but 0 was not greater than 10.
>> (KinesisStreamSuite.scala:228)
>> Run completed in 5 minutes, 0 seconds.
>> Total number of tests run: 4
>> Suites: completed 1, aborted 0
>> Tests: succeeded 2, failed 2, canceled 0, ignored 0, pending 0
>> *** 2 TESTS FAILED ***
>> [INFO]
>> 
>> [INFO] BUILD FAILURE
>> [INFO]
>> 
>>
>>
>> KCL 1.3.0 depends on *1.9.37* SDK (
>> https://github.com/awslabs/amazon-kinesis-client/blob/1.3.0/pom.xml#L26)
>> while the Spark Kinesis dependency was kept at *1.9.16.*
>>
>> I've run the integration tests on branch-1.5 (1.5.3-SNAPSHOT) with AWS SDK
>> 1.9.37 and everything works.
>>
>> Run starting. Expected test count is: 28
>> KinesisBackedBlockRDDSuite:
>> Using endpoint URL https://kinesis.eu-west-1.amazonaws.com for creating
>> Kinesis streams for tests.
>> - Basic reading from Kinesis
>> - Read data available in both block manager and Kinesis
>> - Read data available only in block manager, not in Kinesis
>> - Read data available only in Kinesis, not in block manager
>> - Read data available partially in block manager, rest in Kinesis
>> - Test isBlockValid skips block fetching from block manager
>> - Test whether RDD is valid after removing blocks from block anager
>> KinesisStreamSuite:
>> - KinesisUtils API
>> - RDD generation
>> - basic operation
>> - failure recovery
>> KinesisReceiverSuite:
>> - check serializability of SerializableAWSCredentials
>> - process records including store and checkpoint
>> - shouldn't store and checkpoint when receiver is stopped
>> - shouldn't checkpoint when exception occurs during store
>> - should set checkpoint time to currentTime + checkpoint interval upon
>> instantiation
>> - should checkpoint if we have exceeded the checkpoint interval
>> - shouldn't checkpoint if we have not exceeded the checkpoint interval
>> - should add to time when advancing checkpoint
>> - shutdown should checkpoint if the reason is TERMINATE
>> - shutdown should not checkpoint if the reason is something other than
>> TERMINATE
>> - retry success on first attempt
>> - retry success on second attempt after a Kinesis throttling exception
>> - retry success on second attempt after a Kinesis dependency exception
>> - retry failed after a shutdown exception
>> - retry failed after an invalid state exception
>> - retry failed after unexpected exception
>> - retry failed after exhausing all retries
>> Run completed in 3 minutes, 28 seconds.
>> Total number of tests run: 28
>> Suites: completed 4, aborted 0
>> Tests: succeeded 28, failed 0, canceled 0, ignored 0, pending 0
>> All tests passed.
>>
>> So this is a regression in Spark Streaming Kinesis 1.5.2 - @Brian can you
>> file a JIRA for this?
>>
>> @dev-list, since KCL brings in AWS SDK dependencies itself, is it
>> necessary to declare an explicit dependency on aws-java-sdk in the Kinesis
>> POM? Also, from KCL 1.5.0+, only the relevant components used from the AWS
>> SDKs are brought in, making things a bit leaner (this can be upgraded in
>> Spark 1.7/2.0 perhaps). All local tests (and integration tests) pass with
>> removing the explicit dependency and only depending on KCL. Is aws-java-sdk
>> used anywhere 

Re: JIRA: Wrong dates from imported JIRAs

2015-12-11 Thread Lars Francke
That's a good point. I assume there's always a small risk but it's at least
the documented way from Atlassian to change the creation date so I'd hope
it should be okay. I'd build the minimal CSV file.

I agree that probably not a lot of people are going to search across
projects but on the other hand it's a one-time fix and who knows how long
the Apache Jira is going to live :)

On Fri, Dec 11, 2015 at 9:05 AM, Reynold Xin  wrote:

> Thanks for looking at this. Is it worth fixing? Is there a risk (although
> small) that the re-import would break other things?
>
> Most of those are done and I don't know how often people search JIRAs by
> date across projects.
>
> On Fri, Dec 11, 2015 at 3:40 PM, Lars Francke 
> wrote:
>
>> Hi,
>>
>> I've been digging into JIRA a bit and found a couple of old issues (~250)
>> and I just assume that they are all from the old JIRA.
>>
>> Here's one example:
>>
>> Old: 
>> New: 
>>
>> created": "0012-08-21T09:03:00.000-0800",
>>
>> That's quite impressive but wrong :)
>>
>> That means when you sort all Apache JIRAs by creation date Spark comes
>> first: <
>> https://issues.apache.org/jira/issues/?jql=order%20By%20createdDate%20ASC=250
>> >
>>
>> The dates were already wrong in the source JIRA.
>>
>> Now it seems as if those can be fixed using a CSV import. I still
>> remember how painful the initial import was but this looks relatively
>> straight forward <
>> https://confluence.atlassian.com/display/JIRAKB/How+to+change+the+issue+creation+date+using+CSV+import
>> >
>>
>> If everyone's okay with it I'd raise it with INFRA (and would prepare the
>> necessary CSV file) but as I'm not a committer it'd be great if one/some of
>> the committers could give me a +1
>>
>> Cheers,
>> Lars
>>
>
>


Re: Multi-core support per task in Spark

2015-12-11 Thread Zhan Zhang
I noticed that it is configurable in job level spark.task.cpus.  Anyway to 
support on task level?

Thanks.

Zhan Zhang


On Dec 11, 2015, at 10:46 AM, Zhan Zhang  wrote:

> Hi Folks,
> 
> Is it possible to assign multiple core per task and how? Suppose we have some 
> scenario, in which some tasks are really heavy processing each record and 
> require multi-threading, and we want to avoid similar tasks assigned to the 
> same executors/hosts. 
> 
> If it is not supported, does it make sense to add this feature. It may seems 
> make user worry about more configuration, but by default we can still do 1 
> core per task and only advanced users need to be aware of this feature.
> 
> Thanks.
> 
> Zhan Zhang
> 
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
> 
> 


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: JIRA: Wrong dates from imported JIRAs

2015-12-11 Thread Reynold Xin
Thanks for looking at this. Is it worth fixing? Is there a risk (although
small) that the re-import would break other things?

Most of those are done and I don't know how often people search JIRAs by
date across projects.

On Fri, Dec 11, 2015 at 3:40 PM, Lars Francke 
wrote:

> Hi,
>
> I've been digging into JIRA a bit and found a couple of old issues (~250)
> and I just assume that they are all from the old JIRA.
>
> Here's one example:
>
> Old: 
> New: 
>
> created": "0012-08-21T09:03:00.000-0800",
>
> That's quite impressive but wrong :)
>
> That means when you sort all Apache JIRAs by creation date Spark comes
> first: <
> https://issues.apache.org/jira/issues/?jql=order%20By%20createdDate%20ASC=250
> >
>
> The dates were already wrong in the source JIRA.
>
> Now it seems as if those can be fixed using a CSV import. I still remember
> how painful the initial import was but this looks relatively straight
> forward <
> https://confluence.atlassian.com/display/JIRAKB/How+to+change+the+issue+creation+date+using+CSV+import
> >
>
> If everyone's okay with it I'd raise it with INFRA (and would prepare the
> necessary CSV file) but as I'm not a committer it'd be great if one/some of
> the committers could give me a +1
>
> Cheers,
> Lars
>


A very Minor typo in the Spark paper

2015-12-11 Thread Fengdong Yu
Hi,

I found a very minor typo in:
http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf


Page 4:
We  complement the data mining example in Section 2.2.1 with two iterative 
applications: logistic regression and PageRank.


I read back to section 2.2.1, there is no these two examples. actually, It 
should be Section 3.2.1 and 3.2.2

@Matei


Thanks






Re: coalesce at DataFrame missing argument for shuffle.

2015-12-11 Thread Reynold Xin
I am not sure if we need it. The RDD API has way too many methods and
parameters. As you said, it is simply "repartition".


On Fri, Dec 11, 2015 at 2:56 PM, Hyukjin Kwon  wrote:

> Hi all,
>
> I accidentally met coalesce() function and found this taking arguments
> different for RDD and DataFrame.
>
> It looks shuffle option is missing for DataFrame.
>
> I understand repartition() exactly works as coalesce() with shuffling but
> it looks a bit weird that the same functions take different argument which
> can be easily done just by adding single argument.
>
> Could anybody tell me if this is intendedly missing or not?
>
> Thanks!
> ​
>


Multi-core support per task in Spark

2015-12-11 Thread Zhan Zhang
Hi Folks,

Is it possible to assign multiple core per task and how? Suppose we have some 
scenario, in which some tasks are really heavy processing each record and 
require multi-threading, and we want to avoid similar tasks assigned to the 
same executors/hosts. 

If it is not supported, does it make sense to add this feature. It may seems 
make user worry about more configuration, but by default we can still do 1 core 
per task and only advanced users need to be aware of this feature.

Thanks.

Zhan Zhang

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Maven build against Hadoop 2.4 times out

2015-12-11 Thread Ted Yu
Hi,
You may have noticed that maven build against Hadoop 2.4 times out on Jenkins. 

The last module is spark-hive-thriftserver

This seemed to start with build #4440

FYI 
-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org