[GitHub] spark issue #21917: [SPARK-24720][STREAMING-KAFKA] add option to align range...

2018-10-03 Thread QuentinAmbard
Github user QuentinAmbard commented on the issue:

https://github.com/apache/spark/pull/21917
  
SPARK-25005 has actually a far better solution to detect message lost. Will 
try to apply same logic...


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21917: [SPARK-24720][STREAMING-KAFKA] add option to align range...

2018-08-06 Thread koeninger
Github user koeninger commented on the issue:

https://github.com/apache/spark/pull/21917
  
Recursively creating a Kafka RDD during creation of a Kafka RDD would need
a base case, but yeah, some way to have appropriate preferred locations.

On Mon, Aug 6, 2018 at 2:58 AM, Quentin Ambard 
wrote:

> *@QuentinAmbard* commented on this pull request.
> --
>
> In external/kafka-0-10/src/main/scala/org/apache/spark/streaming/kafka010/
> DirectKafkaInputDStream.scala
> :
>
> > -  val fo = currentOffsets(tp)
> -  OffsetRange(tp.topic, tp.partition, fo, uo)
> +  /**
> +   * Return the offset range. For non consecutive offset the last offset 
must have record.
> +   * If offsets have missing data (transaction marker or abort), 
increases the
> +   * range until we get the requested number of record or no more 
records.
> +   * Because we have to iterate over all the records in this case,
> +   * we also return the total number of records.
> +   * @param offsets the target range we would like if offset were 
continue
> +   * @return (totalNumberOfRecords, updated offset)
> +   */
> +  private def alignRanges(offsets: Map[TopicPartition, Long]): 
Iterable[OffsetRange] = {
> +if (nonConsecutive) {
> +  val localRw = rewinder()
> +  val localOffsets = currentOffsets
> +  
context.sparkContext.parallelize(offsets.toList).mapPartitions(tpos => {
>
> Are you suggesting I should create a new kafkaRDD instead, and consume
> from this RDD to get the last offset range?
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> , or 
mute
> the thread
> 

> .
>



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21917: [SPARK-24720][STREAMING-KAFKA] add option to align range...

2018-08-06 Thread koeninger
Github user koeninger commented on the issue:

https://github.com/apache/spark/pull/21917
  
Example report of skipped offsets in a non-compacted non-transactional
situation


http://mail-archives.apache.org/mod_mbox/kafka-users/201801.mbox/%3ccakwx9vxc1cdosqwwwjk3qmyy3svvtmh+rjdrjyvsbejsds8...@mail.gmail.com%3EFo

I asked on the kafka list about ways to tell if an offset is a
transactional marker.  I also asked about endOffset alternatives, although
I think that doesn't totally solve the problem (for instance, in cases
where the batch size has been rate limited)

On Mon, Aug 6, 2018 at 2:57 AM, Quentin Ambard 
wrote:

> By failed, you mean returned an empty collection after timing out, even
> though records should be available? You don't. You also don't know that it
> isn't just lost because kafka skipped a message. AFAIK from the 
information
> you have from a kafka consumer, once you start allowing gaps in offsets,
> you don't know.
>
> Ok that's interesting, my understanding was that if you successfully poll
> and get results you are 100% sure that you don't lose anything. Do you 
have
> more details on that? Why would kafka skip a record while consuming?
>
> Have you tested comparing the results of consumer.endOffsets for consumers
> with different isolation levels?
>
> endOffsets returns the last offset (same as seekToEnd). But you're right
> that the easiest solution for us would be to have something like
> seekToLastRecord method instead. Maybe something we could also ask ?
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> , or 
mute
> the thread
> 

> .
>



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21917: [SPARK-24720][STREAMING-KAFKA] add option to align range...

2018-08-06 Thread QuentinAmbard
Github user QuentinAmbard commented on the issue:

https://github.com/apache/spark/pull/21917
  
> By failed, you mean returned an empty collection after timing out, even 
though records should be available? You don't. You also don't know that it 
isn't just lost because kafka skipped a message. AFAIK from the information you 
have from a kafka consumer, once you start allowing gaps in offsets, you don't 
know.

Ok that's interesting, my understanding was that if you successfully poll 
and get results you are 100% sure that you don't lose anything.  Do you have 
more details on that? Why would kafka skip a record while consuming?
 
> Have you tested comparing the results of consumer.endOffsets for 
consumers with different isolation levels?

endOffsets returns the last offset (same as seekToEnd). But you're right 
that the easiest solution for us would be to have something like 
seekToLastRecord method instead. Maybe something we could also ask ?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21917: [SPARK-24720][STREAMING-KAFKA] add option to align range...

2018-08-04 Thread koeninger
Github user koeninger commented on the issue:

https://github.com/apache/spark/pull/21917
  
> How do you know that offset 4 isn't just lost because poll failed?

By failed, you mean returned an empty collection after timing out, even 
though records should be available?  You don't.  You also don't know that it 
isn't just lost because kafka skipped a message.  AFAIK from the information 
you have from a kafka consumer, once you start allowing gaps in offsets, you 
don't know.

I understand your point, but even under your proposal you have no guarantee 
that the poll won't work in your first pass during RDD construction, and then 
fail on the executor during computation, right?

> The issue with your proposal is that SeekToEnd gives you the last offset 
which might not be the last record.

Have you tested comparing the results of consumer.endOffsets for consumers 
with different isolation levels?

Your proposal might end up being the best approach anyway, just because of 
the unfortunate effect of StreamInputInfo and count, but I want to make sure we 
think this through.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21917: [SPARK-24720][STREAMING-KAFKA] add option to align range...

2018-08-04 Thread QuentinAmbard
Github user QuentinAmbard commented on the issue:

https://github.com/apache/spark/pull/21917
  
If you are doing it in advance you'll change the range, so for example you 
read until 3 and don't get any extra results. Maybe it's because of a 
transaction offset, maybe another issue, it's ok in both cases.
The big difference is that the next batch will restart from offset 3 and 
poll from this value. If seek to 3 and poll get you another result (for example 
6) then everything is fine  it's not a data loss it's just a gap.
The issue with your proposal is that SeekToEnd gives you the last offset 
which might not be the last record.
So in your example if last offset is 5 and after a few poll the last record 
you get is 3 what do you do, continue and execute the next batch from 5? How do 
you know that offset 4 isn't just lost because poll failed?
The only way to know that would be to get a record with an offset higher 
than 5. In this case you know it's just a gap. 
But if the message you are reading is the last of the topic you won't have 
records higher than 3, do you can't tell if it's a poll failure or an empty 
offset because of the transaction commit


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21917: [SPARK-24720][STREAMING-KAFKA] add option to align range...

2018-08-04 Thread koeninger
Github user koeninger commented on the issue:

https://github.com/apache/spark/pull/21917
  
If the last offset in the range as calculated by the driver is 5, and on 
the executor all you can poll up to after a repeated attempt is 3, and the user 
already told you to allowNonConsecutiveOffsets... then you're done, no error.

Why does it matter if you do this logic when you're reading all the 
messages in advance and counting, or when you're actually computing? 

To put it another way, this PR is a lot of code change and refactoring, why 
not just change the logic of e.g. how CompactedKafkaRDDIterator interacts with 
compactedNext?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21917: [SPARK-24720][STREAMING-KAFKA] add option to align range...

2018-08-04 Thread QuentinAmbard
Github user QuentinAmbard commented on the issue:

https://github.com/apache/spark/pull/21917
  
I'm not sure to understand your point. The cause of the gap doesn't matter, 
we just want to stop on an existing offset to be able to poll it. It can be 
because of a transaction marker, a transaction abort or even just a temporary 
poll failure it's not relevant in this case.
The driver is smart enough to be able to restart from any Offset, even in 
the middle of a transaction (abort or not)
The issue with gap at the end is that you can't know if it's a gap or if 
the poll failed. 
For example SeekToEnd gives you 5 but the last record you get is 3 and 
there is no way to know if 4 is missing or just an offset gap.
How could we fix that in a different way?



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21917: [SPARK-24720][STREAMING-KAFKA] add option to align range...

2018-08-04 Thread koeninger
Github user koeninger commented on the issue:

https://github.com/apache/spark/pull/21917
  
Still playing devil's advocate here, I don't think stopping at 3 in your 
example actually tells you anything about the cause of the gaps in the sequence 
at 4.  I'm not sure you can know that the gap is because of a transaction 
marker, without a modified kafka consumer library.

If the actual problem is that when allowNonConsecutiveOffsets is set we 
need to allow gaps even at the end of an offset range... why not just fix that 
directly?

Master is updated to kafka 2.0 at this point, so we should be able to write 
a test for your original jira example of a partition consisting of 1 message 
followed by 1 transaction commit.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21917: [SPARK-24720][STREAMING-KAFKA] add option to align range...

2018-08-03 Thread QuentinAmbard
Github user QuentinAmbard commented on the issue:

https://github.com/apache/spark/pull/21917
  
With this solution we don't read the data another time "just to support 
transaction."
The current implementation of compacted topics already ready all the 
messages twice in order to get a correct count for the info tracker: `val 
inputInfo = StreamInputInfo(id, rdd.count, metadata)`
Since RDD are compacted or have "empty offset" due to transactions 
markers/abort, the only way to get the exact count is to read.
The same logic applies to select a range: if you want to get N records per 
partition, the only way to know what "untilOffset" will make you read N records 
is to read, stop when you've read N records, and get the offset.
So one of the advantage is to be able to fetch the number of records per 
partition you really want (for compacted topics and transactions).
But the real advantage is that it let you pick an "untilOffset" that is not 
empty.
For example if you don't have record for offset [4, 6]

```1, 2, 3, 4, 5, 6
a, b, c,   ,   ,  ,```

if use an offset range of [1,5), you will try to read 4 but won't receive 
any data. In this scenario you can't tell if data is missing (so it's ok) or 
you lose some data because kafka is down (not ok)

To deal with this situation, we first scan the offset and stop to the last 
offset where we had data, in the example instead of [1,5) we would go with 
[1,4) because 3 has data so it's safe to stop at 3
During the next batch, if we then get extra data we would then select the 
next range as [4, 8) and won't have any issue.
```1, 2, 3, 4, 5, 6, 7
a, b, c,   ,   ,  , g```

does that make sense?
(PS: sorry for the multiple commit, I wasn't running Mima properly, I'll 
fix the remaining issue soon)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21917: [SPARK-24720][STREAMING-KAFKA] add option to align range...

2018-08-02 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21917
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94058/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21917: [SPARK-24720][STREAMING-KAFKA] add option to align range...

2018-08-02 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21917
  
**[Test build #94058 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94058/testReport)**
 for PR 21917 at commit 
[`69582f4`](https://github.com/apache/spark/commit/69582f46cb8c7f7285c42b86d06bf475a43a3856).
 * This patch **fails MiMa tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21917: [SPARK-24720][STREAMING-KAFKA] add option to align range...

2018-08-02 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21917
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21917: [SPARK-24720][STREAMING-KAFKA] add option to align range...

2018-08-02 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21917
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94056/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21917: [SPARK-24720][STREAMING-KAFKA] add option to align range...

2018-08-02 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21917
  
**[Test build #94056 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94056/testReport)**
 for PR 21917 at commit 
[`29c5406`](https://github.com/apache/spark/commit/29c5406be32c112c11a6552b06925d3541c7d472).
 * This patch **fails MiMa tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21917: [SPARK-24720][STREAMING-KAFKA] add option to align range...

2018-08-02 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21917
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21917: [SPARK-24720][STREAMING-KAFKA] add option to align range...

2018-08-02 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21917
  
**[Test build #94058 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94058/testReport)**
 for PR 21917 at commit 
[`69582f4`](https://github.com/apache/spark/commit/69582f46cb8c7f7285c42b86d06bf475a43a3856).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21917: [SPARK-24720][STREAMING-KAFKA] add option to align range...

2018-08-02 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21917
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21917: [SPARK-24720][STREAMING-KAFKA] add option to align range...

2018-08-02 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21917
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94055/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21917: [SPARK-24720][STREAMING-KAFKA] add option to align range...

2018-08-02 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21917
  
**[Test build #94055 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94055/testReport)**
 for PR 21917 at commit 
[`70ecd38`](https://github.com/apache/spark/commit/70ecd38e05073c6713855413512fe6a9452f2994).
 * This patch **fails MiMa tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21917: [SPARK-24720][STREAMING-KAFKA] add option to align range...

2018-08-02 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21917
  
**[Test build #94056 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94056/testReport)**
 for PR 21917 at commit 
[`29c5406`](https://github.com/apache/spark/commit/29c5406be32c112c11a6552b06925d3541c7d472).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21917: [SPARK-24720][STREAMING-KAFKA] add option to align range...

2018-08-02 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21917
  
**[Test build #94055 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94055/testReport)**
 for PR 21917 at commit 
[`70ecd38`](https://github.com/apache/spark/commit/70ecd38e05073c6713855413512fe6a9452f2994).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21917: [SPARK-24720][STREAMING-KAFKA] add option to align range...

2018-07-30 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21917
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/93803/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21917: [SPARK-24720][STREAMING-KAFKA] add option to align range...

2018-07-30 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21917
  
**[Test build #93803 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93803/testReport)**
 for PR 21917 at commit 
[`05c7e7f`](https://github.com/apache/spark/commit/05c7e7fb96806c07bc9b0513ef59fbcdd5ae9118).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21917: [SPARK-24720][STREAMING-KAFKA] add option to align range...

2018-07-30 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21917
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21917: [SPARK-24720][STREAMING-KAFKA] add option to align range...

2018-07-30 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21917
  
**[Test build #93803 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93803/testReport)**
 for PR 21917 at commit 
[`05c7e7f`](https://github.com/apache/spark/commit/05c7e7fb96806c07bc9b0513ef59fbcdd5ae9118).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21917: [SPARK-24720][STREAMING-KAFKA] add option to align range...

2018-07-30 Thread koeninger
Github user koeninger commented on the issue:

https://github.com/apache/spark/pull/21917
  
jenkins, ok to test


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21917: [SPARK-24720][STREAMING-KAFKA] add option to align range...

2018-07-30 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21917
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21917: [SPARK-24720][STREAMING-KAFKA] add option to align range...

2018-07-30 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21917
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21917: [SPARK-24720][STREAMING-KAFKA] add option to align range...

2018-07-30 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21917
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21917: [SPARK-24720][STREAMING-KAFKA] add option to align range...

2018-07-30 Thread holdensmagicalunicorn
Github user holdensmagicalunicorn commented on the issue:

https://github.com/apache/spark/pull/21917
  
@QuentinAmbard, thanks! I am a bot who has found some folks who might be 
able to help with the review:@tdas, @zsxwing and @koeninger


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org