[GitHub] clintropolis opened a new pull request #6107: Order rows during incremental index persist when rollup is disabled.
clintropolis opened a new pull request #6107: Order rows during incremental index persist when rollup is disabled. URL: https://github.com/apache/incubator-druid/pull/6107 Resolves #6066 by modifying the `FactsHolder` interface to include a new method `Iterable getPersistIterable()` and using this when persisting incremental indexes. Added an additional benchmark generator schema with 4 low cardinality dimensions to enable testing this scenario. Before this patch: ``` Benchmark(rollup) (rollupOpportunity) (rowsPerSegment) (schema) Mode Cnt Score Error Units IndexPersistBenchmark.persistV9 true none 75000 rollo avgt 25 429409.821 ± 17771.526 us/op IndexPersistBenchmark.persistV9 true moderate 75000 rollo avgt 25 57578.929 ± 2650.596 us/op IndexPersistBenchmark.persistV9 true high 75000 rollo avgt 25 11023.976 ± 461.142 us/op IndexPersistBenchmark.persistV9 false none 75000 rollo avgt 25 414289.365 ± 16384.902 us/op IndexPersistBenchmark.persistV9 false moderate 75000 rollo avgt 25 407060.720 ± 16965.695 us/op IndexPersistBenchmark.persistV9 false high 75000 rollo avgt 25 48.825 ± 19613.728 us/op size [2262258] bytes. size [276631] bytes. size [47597] bytes. size [2280590] bytes. size [2095354] bytes. size [2094972] bytes. ``` After: ``` Benchmark(rollup) (rollupOpportunity) (rowsPerSegment) (schema) Mode Cnt Score Error Units IndexPersistBenchmark.persistV9 true none 75000 rollo avgt 25 436966.463 ± 45936.358 us/op IndexPersistBenchmark.persistV9 true moderate 75000 rollo avgt 25 54724.237 ± 7500.566 us/op IndexPersistBenchmark.persistV9 true high 75000 rollo avgt 25 11010.033 ± 718.345 us/op IndexPersistBenchmark.persistV9 false none 75000 rollo avgt 25 464730.668 ± 30413.613 us/op IndexPersistBenchmark.persistV9 false moderate 75000 rollo avgt 25 523597.179 ± 43443.648 us/op IndexPersistBenchmark.persistV9 false high 75000 rollo avgt 25 535282.839 ± 46529.297 us/op size [2262258] bytes. size [276631] bytes. size [47597] bytes. size [2269144] bytes. size [1475402] bytes. size [1357298] bytes. ``` Actual difference in segment size will vary quite a lot from this contrived scenario, but should generally be smaller, at the cost of slower index persist time. Query performance should be unaffected. See #6066 for additional benchmarks and discussion. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
[GitHub] jihoonson opened a new pull request #6106: [Backport] Fix IllegalArgumentException in TaskLockBox.syncFromStorage() when updating from 0.12.x to 0.12.2
jihoonson opened a new pull request #6106: [Backport] Fix IllegalArgumentException in TaskLockBox.syncFromStorage() when updating from 0.12.x to 0.12.2 URL: https://github.com/apache/incubator-druid/pull/6106 Backport of #6086 to 0.12.2. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
[GitHub] jihoonson commented on a change in pull request #6095: Add support 'keepSegmentGranularity' for compactionTask
jihoonson commented on a change in pull request #6095: Add support 'keepSegmentGranularity' for compactionTask URL: https://github.com/apache/incubator-druid/pull/6095#discussion_r207683180 ## File path: indexing-service/src/main/java/io/druid/indexing/common/task/CompactionTask.java ## @@ -263,34 +282,75 @@ static IndexIngestionSpec createIngestionSchema( final List> timelineSegments = pair.rhs; if (timelineSegments.size() == 0) { - return null; + return Collections.emptyList(); } -final DataSchema dataSchema = createDataSchema( -segmentProvider.dataSource, -segmentProvider.interval, -dimensionsSpec, -toolbox.getIndexIO(), -jsonMapper, -timelineSegments, -segmentFileMap -); -return new IndexIngestionSpec( -dataSchema, -new IndexIOConfig( -new IngestSegmentFirehoseFactory( -segmentProvider.dataSource, -segmentProvider.interval, -null, // no filter -// set dimensions and metrics names to make sure that the generated dataSchema is used for the firehose - dataSchema.getParser().getParseSpec().getDimensionsSpec().getDimensionNames(), - Arrays.stream(dataSchema.getAggregators()).map(AggregatorFactory::getName).collect(Collectors.toList()), -toolbox.getIndexIO() -), -false -), -tuningConfig -); +if (keepSegmentGranularity) { + // if keepSegmentGranularity = true, create indexIngestionSpec per segment interval, so that we can run an index + // task per segment interval. + final List specs = new ArrayList<>(timelineSegments.size()); + for (TimelineObjectHolder holder : timelineSegments) { +final DataSchema dataSchema = createDataSchema( +segmentProvider.dataSource, +holder.getInterval(), +Collections.singletonList(holder), +dimensionsSpec, +toolbox.getIndexIO(), +jsonMapper, +segmentFileMap +); + +specs.add( +new IndexIngestionSpec( +dataSchema, +new IndexIOConfig( +new IngestSegmentFirehoseFactory( +segmentProvider.dataSource, +holder.getInterval(), +null, // no filter +// set dimensions and metrics names to make sure that the generated dataSchema is used for the firehose + dataSchema.getParser().getParseSpec().getDimensionsSpec().getDimensionNames(), + Arrays.stream(dataSchema.getAggregators()).map(AggregatorFactory::getName).collect(Collectors.toList()), +toolbox.getIndexIO() +), +false +), +tuningConfig +) +); + } + + return specs; +} else { + final DataSchema dataSchema = createDataSchema( + segmentProvider.dataSource, + segmentProvider.interval, + timelineSegments, + dimensionsSpec, + toolbox.getIndexIO(), + jsonMapper, + segmentFileMap + ); + + return Collections.singletonList( + new IndexIngestionSpec( + dataSchema, + new IndexIOConfig( Review comment: Sounds good. Fixed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
[GitHub] gianm commented on issue #5938: URL encode datasources, task ids, authenticator names.
gianm commented on issue #5938: URL encode datasources, task ids, authenticator names. URL: https://github.com/apache/incubator-druid/pull/5938#issuecomment-410389509 I've pushed a fix for the tests. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
[GitHub] gianm commented on issue #1836: MapCache needs re-written
gianm commented on issue #1836: MapCache needs re-written URL: https://github.com/apache/incubator-druid/issues/1836#issuecomment-410388709 Yes. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
[GitHub] gianm commented on issue #1835: Schema differences of GroupBy vs TopN/TimeSeries return data should be not be present
gianm commented on issue #1835: Schema differences of GroupBy vs TopN/TimeSeries return data should be not be present URL: https://github.com/apache/incubator-druid/issues/1835#issuecomment-410388362 Sure, let's close it won't-fix. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
[GitHub] gianm closed issue #1835: Schema differences of GroupBy vs TopN/TimeSeries return data should be not be present
gianm closed issue #1835: Schema differences of GroupBy vs TopN/TimeSeries return data should be not be present URL: https://github.com/apache/incubator-druid/issues/1835 This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
[GitHub] gianm closed issue #1746: Tasks that are waiting on locks are reporting as RUNNING
gianm closed issue #1746: Tasks that are waiting on locks are reporting as RUNNING URL: https://github.com/apache/incubator-druid/issues/1746 This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
[GitHub] gianm commented on issue #1746: Tasks that are waiting on locks are reporting as RUNNING
gianm commented on issue #1746: Tasks that are waiting on locks are reporting as RUNNING URL: https://github.com/apache/incubator-druid/issues/1746#issuecomment-410388134 Ah yes. Now they are "WAITING" if they are waiting on locks. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
[GitHub] gianm commented on issue #1671: Use isolated classloader for javascript related items
gianm commented on issue #1671: Use isolated classloader for javascript related items URL: https://github.com/apache/incubator-druid/issues/1671#issuecomment-410387273 I suppose it's valid as long as we support javascript at all, so we could leave it open. I am not sure if anyone has quantified the impact or not so I am not sure how big of a deal this is. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
[GitHub] gianm commented on issue #1449: Allow sorting by timestamp
gianm commented on issue #1449: Allow sorting by timestamp URL: https://github.com/apache/incubator-druid/issues/1449#issuecomment-410387085 Yes, it is. It became possible as part of the numeric dimensions feature. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
[GitHub] gianm closed issue #1449: Allow sorting by timestamp
gianm closed issue #1449: Allow sorting by timestamp URL: https://github.com/apache/incubator-druid/issues/1449 This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
[GitHub] gianm commented on issue #1204: Define granularities in UTC
gianm commented on issue #1204: Define granularities in UTC URL: https://github.com/apache/incubator-druid/issues/1204#issuecomment-410386867 I believe #4611 fixed this. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
[GitHub] jon-wei closed pull request #6072: validate baseDataSource non-empty string in DerivativeDataSourceMetad…
jon-wei closed pull request #6072: validate baseDataSource non-empty string in DerivativeDataSourceMetad… URL: https://github.com/apache/incubator-druid/pull/6072 This is a PR merged from a forked repository. As GitHub hides the original diff on merge, it is displayed below for the sake of provenance: As this is a foreign pull request (from a fork), the diff is supplied below (as it won't show otherwise due to GitHub magic): diff --git a/extensions-contrib/materialized-view-maintenance/src/main/java/io/druid/indexing/materializedview/DerivativeDataSourceMetadata.java b/extensions-contrib/materialized-view-maintenance/src/main/java/io/druid/indexing/materializedview/DerivativeDataSourceMetadata.java index ed102e354e0..3a82672e14d 100644 --- a/extensions-contrib/materialized-view-maintenance/src/main/java/io/druid/indexing/materializedview/DerivativeDataSourceMetadata.java +++ b/extensions-contrib/materialized-view-maintenance/src/main/java/io/druid/indexing/materializedview/DerivativeDataSourceMetadata.java @@ -23,6 +23,7 @@ import com.fasterxml.jackson.annotation.JsonProperty; import com.google.common.base.Preconditions; import com.google.common.collect.Sets; +import com.google.common.base.Strings; import io.druid.indexing.overlord.DataSourceMetadata; import java.util.Objects; @@ -41,7 +42,9 @@ public DerivativeDataSourceMetadata( @JsonProperty("metrics") Set metrics ) { -this.baseDataSource = Preconditions.checkNotNull(baseDataSource, "baseDataSource cannot be null. This is not a valid DerivativeDataSourceMetadata."); +Preconditions.checkArgument(!Strings.isNullOrEmpty(baseDataSource), "baseDataSource cannot be null or empty. Please provide a baseDataSource."); +this.baseDataSource = baseDataSource; + this.dimensions = Preconditions.checkNotNull(dimensions, "dimensions cannot be null. This is not a valid DerivativeDataSourceMetadata."); this.metrics = Preconditions.checkNotNull(metrics, "metrics cannot be null. This is not a valid DerivativeDataSourceMetadata."); } diff --git a/extensions-contrib/materialized-view-maintenance/src/test/java/io/druid/indexing/materializedview/DerivativeDataSourceMetadataTest.java b/extensions-contrib/materialized-view-maintenance/src/test/java/io/druid/indexing/materializedview/DerivativeDataSourceMetadataTest.java new file mode 100644 index 000..d62ea385d31 --- /dev/null +++ b/extensions-contrib/materialized-view-maintenance/src/test/java/io/druid/indexing/materializedview/DerivativeDataSourceMetadataTest.java @@ -0,0 +1,61 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package io.druid.indexing.materializedview; + +import com.google.common.collect.Sets; +import org.hamcrest.CoreMatchers; +import org.junit.Rule; +import org.junit.Test; +import org.junit.rules.ExpectedException; + +import java.util.Set; + + +public class DerivativeDataSourceMetadataTest +{ + @Rule + public ExpectedException expectedException = ExpectedException.none(); + + @Test + public void testEmptyBaseDataSource() throws Exception + { + expectedException.expect(CoreMatchers.instanceOf(IllegalArgumentException.class)); +expectedException.expectMessage( +"baseDataSource cannot be null or empty. Please provide a baseDataSource." +); +String baseDataSource = ""; +Set dims = Sets.newHashSet("dim1", "dim2", "dim3"); +Set metrics = Sets.newHashSet("cost"); +DerivativeDataSourceMetadata metadata = new DerivativeDataSourceMetadata(baseDataSource, dims, metrics); + } + + @Test + public void testNullBaseDataSource() throws Exception + { + expectedException.expect(CoreMatchers.instanceOf(IllegalArgumentException.class)); +expectedException.expectMessage( +"baseDataSource cannot be null or empty. Please provide a baseDataSource." +); +String baseDataSource = null; +Set dims = Sets.newHashSet("dim1", "dim2", "dim3"); +Set metrics = Sets.newHashSet("cost"); +DerivativeDataSourceMetadata metadata = new DerivativeDataSourceMetadata(baseDataSource, dims, metrics); + } +} This is an automated message from the Apache Git
[GitHub] himanshug commented on issue #6066: Sorting rows when rollup is disabled
himanshug commented on issue #6066: Sorting rows when rollup is disabled URL: https://github.com/apache/incubator-druid/issues/6066#issuecomment-410326897 @clintropolis if you do end up further benchmarking sorting at persist time, then you can also consider reordering the dimensions from low to high cardinality as they are known at that point to potentially improve on size. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
[GitHub] leventov commented on a change in pull request #5957: Renamed GenericColumnSerializer to ColumnSerializer; 'Generic Column' -> 'Numeric Column'; Fixed a few resource leaks in processing; Fixe
leventov commented on a change in pull request #5957: Renamed GenericColumnSerializer to ColumnSerializer; 'Generic Column' -> 'Numeric Column'; Fixed a few resource leaks in processing; Fixed a bug in SingleStringInputDimensionSelector; misc refinements URL: https://github.com/apache/incubator-druid/pull/5957#discussion_r207604065 ## File path: benchmarks/src/main/java/io/druid/benchmark/package-info.java ## @@ -17,17 +17,7 @@ * under the License. */ -package io.druid.segment; +@EverythingIsNonnullByDefault +package io.druid.benchmark; -import io.druid.guice.annotations.ExtensionPoint; -import io.druid.segment.serde.Serializer; - -import java.io.IOException; - -@ExtensionPoint -public interface GenericColumnSerializer extends Serializer Review comment: Next release is a major release (0.12.x -> 0.13). I don't think we should fall into conservatism (not fixing things because users may depend on them) until Druid 1.0, and clean up APIs eagerly until before that. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
[GitHub] leventov commented on a change in pull request #5957: Renamed GenericColumnSerializer to ColumnSerializer; 'Generic Column' -> 'Numeric Column'; Fixed a few resource leaks in processing; Fixe
leventov commented on a change in pull request #5957: Renamed GenericColumnSerializer to ColumnSerializer; 'Generic Column' -> 'Numeric Column'; Fixed a few resource leaks in processing; Fixed a bug in SingleStringInputDimensionSelector; misc refinements URL: https://github.com/apache/incubator-druid/pull/5957#discussion_r207604065 ## File path: benchmarks/src/main/java/io/druid/benchmark/package-info.java ## @@ -17,17 +17,7 @@ * under the License. */ -package io.druid.segment; +@EverythingIsNonnullByDefault +package io.druid.benchmark; -import io.druid.guice.annotations.ExtensionPoint; -import io.druid.segment.serde.Serializer; - -import java.io.IOException; - -@ExtensionPoint -public interface GenericColumnSerializer extends Serializer Review comment: Next release is a major release (0.12.x -> 0.13). I don't think we should fall into conservatism (not fixing things because users may depend on them) until Druid 1.0, and clean up APIs eagerly before that. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
[GitHub] aoeiuvb opened a new issue #6102: "Cannot have same delimiter and list delimiter of \u0001"
aoeiuvb opened a new issue #6102: "Cannot have same delimiter and list delimiter of \u0001" URL: https://github.com/apache/incubator-druid/issues/6102 When I used hadoop to absorb data, I used DelimitedParser to parse the row. The delimiter I set was **\u0001**, but the following error was reported to me:"**Cannot have the same delimiter and list delimiter of \u0001**". I looked at the source code of the DelimitedParser and found that in the constructor, the delimiter and listDelimiter are checked to see if they are equal, and if so, an error will be reported.I did not set the value of listDelimiter, which is also the default value: **\u0001**, so this error was reported.To solve this problem, I specifically specified other values in the listDelimiter so that the program could pass. May I ask if there is any problem with the design of this piece? https://user-images.githubusercontent.com/15936294/43644856-be5ec84a-9762-11e8-8cb1-899a8bb02098.png;> This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
[GitHub] chengchengpei commented on issue #6072: validate baseDataSource non-empty string in DerivativeDataSourceMetad…
chengchengpei commented on issue #6072: validate baseDataSource non-empty string in DerivativeDataSourceMetad… URL: https://github.com/apache/incubator-druid/pull/6072#issuecomment-410242447 @jihoonson NP. How can I become committer? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
[GitHub] aryaflow commented on issue #5221: Support Hadoop batch ingestion for druid-azure-extensions
aryaflow commented on issue #5221: Support Hadoop batch ingestion for druid-azure-extensions URL: https://github.com/apache/incubator-druid/pull/5221#issuecomment-410241234 @hoesler hadoop-azure is not added by default. But it's needed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
[GitHub] bohemia420 commented on issue #5150: Druid-parquet-extensions fails on timestamps (stored as INT96) in parquet files
bohemia420 commented on issue #5150: Druid-parquet-extensions fails on timestamps (stored as INT96) in parquet files URL: https://github.com/apache/incubator-druid/issues/5150#issuecomment-410239779 is this resolved? @amalakar how did u remove avro converter? @gianm how does one change the parquet reading strategy? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
[GitHub] aryaflow commented on issue #5221: Support Hadoop batch ingestion for druid-azure-extensions
aryaflow commented on issue #5221: Support Hadoop batch ingestion for druid-azure-extensions URL: https://github.com/apache/incubator-druid/pull/5221#issuecomment-410234843 Thanks for the feature. I tried this in a cluster and succeeded to make it work... but I had to add wasb to JobHelper.java as well. `} else if ("wasbs".equals(type)) { segmentLocURI = URI.create(loadSpec.get("path").toString()); }` This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
[GitHub] robertervin opened a new issue #6101: Druid InDimFilter Fails on Empty Values
robertervin opened a new issue #6101: Druid InDimFilter Fails on Empty Values URL: https://github.com/apache/incubator-druid/issues/6101 ### Use Case: We dynamically build Druid queries by allowing the user to filter the UI. Sometimes this results in the user filtering out all options (sometimes unintentionally), and is something we want to allow for. ### Problem In https://github.com/apache/incubator-druid/blob/04ea3c9f8c1f5ea34b023217bd709509ace4d30d/processing/src/main/java/io/druid/query/filter/InDimFilter.java#L78 Druid fails when filtering on empty values, though I see no reason why it should since the result should just be empty. e.g. in pseudocode, ``` mydatasource.filter(mycolumn in []) = [] ``` Is there a better reason for this, or can I submit a PR to remove that restriction? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
[GitHub] asdf2014 commented on issue #6099: Inconsistencies in the result of the quantile aggregator
asdf2014 commented on issue #6099: Inconsistencies in the result of the quantile aggregator URL: https://github.com/apache/incubator-druid/issues/6099#issuecomment-410201965 Hi, @jacktomcat . This result is obtained because the quantile aggregator uses an approximate algorithm called `sketch`. You can find it in the Druid [doc](http://druid.io/docs/latest/development/extensions-core/approximate-histograms.html). A [parper](http://jmlr.org/papers/volume11/ben-haim10a/ben-haim10a.pdf) and [blog](https://metamarkets.com/2013/histograms/) are also available in the document to learn the algorithm deeply. BTW, there is a [Table](https://datasketches.github.io/docs/Quantiles/QuantilesAccuracy.html) Guide for Quantiles DoublesSketch Size in Bytes and Approximate Error, which might help you. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
[GitHub] dyanarose opened a new issue #6100: DOCS: ingestion spec documentation should be expanded
dyanarose opened a new issue #6100: DOCS: ingestion spec documentation should be expanded URL: https://github.com/apache/incubator-druid/issues/6100 - The dataSchema would be much clearer if an example of the raw data was included. - The importance of the metric spec would be clearer if there was a dedicated section explaining how it's used when querying along with example queries that are made possible by the example metrics spec (some information is spread out in the design document and the schema-design document, but the metricSpec should have it's own dedicated and complete documentation) https://github.com/apache/incubator-druid/blob/master/docs/content/ingestion/index.md This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
[GitHub] asdf2014 commented on issue #6047: Minor change in the "Start up Druid services" Section
asdf2014 commented on issue #6047: Minor change in the "Start up Druid services" Section URL: https://github.com/apache/incubator-druid/pull/6047#issuecomment-410184869 Hi, @hpandeycodeit . I also encountered this situation. There is [formula](http://druid.io/docs/latest/configuration/broker.html#processing) for it in the Druid doc. In my experience with this problem, you may need to add `-XX:MaxDirectMemorySize` option to both of those `conf/druid/historical/jvm.config` and `conf/druid/broker/jvm.config` files :sweat_smile: This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
[GitHub] clintropolis commented on issue #6066: Sorting rows when rollup is disabled
clintropolis commented on issue #6066: Sorting rows when rollup is disabled URL: https://github.com/apache/incubator-druid/issues/6066#issuecomment-410183701 Ok, so I had to know, so I went ahead and did benchmarks if we do the other way and sort at persist time. no rollup opportunity: ``` Benchmark(rollup) (rowsPerSegment) (schema) Mode Cnt ScoreError Units IndexPersistBenchmark.persistV9 true 75000 basic avgt 25 499315.212 ± 154036.971 us/op IndexPersistBenchmark.persistV9 false 75000 basic avgt 25 449792.742 ± 28218.504 us/op IndexPersistBenchmark.persistV9 false (ordered) 75000 basic avgt 25 508051.563 ± 63033.662 us/op all size: [3038874] bytes. ``` moderate rollup opportunity: ``` Benchmark(rollup) (rowsPerSegment) (schema) Mode Cnt Score Error Units IndexPersistBenchmark.persistV9 true 75000 basic avgt 25 406840.576 ± 20732.769 us/op IndexPersistBenchmark.persistV9 false 75000 basic avgt 25 431725.214 ± 18793.693 us/op IndexPersistBenchmark.persistV9 false (ordered) 75000 basic avgt 25 494056.572 ± 34396.770 us/op rollup: size [2285574] bytes. no-rollup: size [2741399] bytes. ordered-no-rollup: size [2516639] bytes. ``` more rollup: ``` Benchmark(rollup) (rowsPerSegment) (schema) Mode Cnt Score Error Units IndexPersistBenchmark.persistV9 true 75000 basic avgt 25 338251.339 ± 22031.319 us/op IndexPersistBenchmark.persistV9 false 75000 basic avgt 25 443272.327 ± 25099.425 us/op IndexPersistBenchmark.persistV9 false (ordered) 75000 basic avgt 25 552234.263 ± 41889.207 us/op rollup: size [1755456] bytes. no-rollup: size [2741017] bytes. ordered-no-rollup: size [2346649] bytes. ``` I don't have strong feelings about the best way to do this, persist performance cost looks to be on the range of 15-20% slower here. Maybe better to sort at persist time to not risk impact to query performance? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
[GitHub] asdf2014 commented on issue #6090: Fix missing exception handling as part of `io.druid.java.util.http.client.netty.HttpClientPipelineFactory`
asdf2014 commented on issue #6090: Fix missing exception handling as part of `io.druid.java.util.http.client.netty.HttpClientPipelineFactory` URL: https://github.com/apache/incubator-druid/pull/6090#issuecomment-410170672 BTW, when i tried to use `ExpectedException` instead of `Assert.assertTrue` in `JankyServersTest` for other test cases, i realized the `JankyServersTest#isChannelClosedException` situation was hard to convert to `ExpectedException` way, so i created a sub-calss of `TypeSafeMatcher` to resolve the problem. ```java private static class CauseMatcher extends TypeSafeMatcher { private final Class expectedType; private final String expectedMessage; private final boolean isRegex; public CauseMatcher(Class expectedType, String expectedMessage) { this.expectedType = expectedType; this.expectedMessage = expectedMessage; this.isRegex = false; } public CauseMatcher(Class expectedType, String expectedMessage, boolean isRegex) { this.expectedType = expectedType; this.expectedMessage = expectedMessage; this.isRegex = isRegex; } @Override protected boolean matchesSafely(Throwable item) { if (item == null || item.getClass() == null || item.getMessage() == null) { return false; } if (!item.getClass().isAssignableFrom(expectedType)) { return false; } if (isRegex) { return Pattern.compile(expectedMessage).matcher(item.getMessage()).find(); } else { return item.getMessage().contains(expectedMessage); } } @Override public void describeTo(Description description) { description.appendText("expects type is ") .appendValue(expectedType) .appendText(" and message is ") .appendValue(expectedMessage); } } ``` Then, use the following code to express the `JankyServersTest#isChannelClosedException` logic. ```java expectedException.expectCause( anyOf( new CauseMatcher(ChannelException.class, "Faulty channel in resource pool"), new CauseMatcher(IOException.class, ".*Connection reset by peer.*", true) ) ); ``` However, this change will add too many new lines of code. If just to solve this single situation would be a huge cost, but it can be used as a common util class for other test cases, then it might be worth adding. What do you think? @jihoonson This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
[GitHub] asdf2014 commented on issue #6090: Fix missing exception handling as part of `io.druid.java.util.http.client.netty.HttpClientPipelineFactory`
asdf2014 commented on issue #6090: Fix missing exception handling as part of `io.druid.java.util.http.client.netty.HttpClientPipelineFactory` URL: https://github.com/apache/incubator-druid/pull/6090#issuecomment-410165818 Hi, @jihoonson . After patching these changes, I found the same problem still exist in `ChannelResourceFactory`, you can run the `JankyServersTest#testHttpsEchoServer` test case to reproduce it. So, i added another anonymous class of `SimpleChannelUpstreamHandler` to fix it. PTAL. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
[GitHub] clintropolis edited a comment on issue #6066: Sorting rows when rollup is disabled
clintropolis edited a comment on issue #6066: Sorting rows when rollup is disabled URL: https://github.com/apache/incubator-druid/issues/6066#issuecomment-410153475 I ran some additional benchmarks after realizing that the generated rows from previous benchmarks were rows with no opportunity for actual rollup to occur (all segments were approximately the same size for the numbers above). Here are timeseries benches with moderate rollup opportunity: ``` Benchmark(numSegments) (rollupSchema) (rowsPerSegment) (schemaAndQuery) Mode Cnt Score Error Units TimeseriesBenchmark.querySingleIncrementalIndex 1 no-rollup75 basic.A avgt 25 663840.128 ± 26363.127 us/op TimeseriesBenchmark.querySingleIncrementalIndex 1 ordered-no-rollup75 basic.A avgt 25 679784.179 ± 81577.842 us/op TimeseriesBenchmark.querySingleIncrementalIndex 1 rollup75 basic.A avgt 25 62446.589 ± 2224.296 us/op no-rollup: size [22387432] bytes. ordered-no-rollup: size [18195470] bytes. rollup: size [2206430] bytes. ``` and heavy rollup potential: ``` Benchmark(numSegments) (rollupSchema) (rowsPerSegment) (schemaAndQuery) Mode Cnt Score Error Units TimeseriesBenchmark.querySingleIncrementalIndex 1 no-rollup75 basic.A avgt 25 653316.845 ± 31964.338 us/op TimeseriesBenchmark.querySingleIncrementalIndex 1 ordered-no-rollup75 basic.A avgt 25 769623.711 ± 12299.182 us/op TimeseriesBenchmark.querySingleIncrementalIndex 1 rollup75 basic.A avgt 256545.777 ± 607.087 us/op no-rollup: size [22383561] bytes. ordered-no-rollup: size [16900327] bytes. rollup: size [237206] bytes. ``` and TopN: moderate rollup: ``` Benchmark (numSegments) (rollupSchema) (rowsPerSegment) (schemaAndQuery) (threshold) Mode Cnt Score Error Units TopNBenchmark.querySingleIncrementalIndex 1 no-rollup 75 basic.A 10 avgt 25 893805.325 ± 9592.710 us/op TopNBenchmark.querySingleIncrementalIndex 1 ordered-no-rollup 75 basic.A 10 avgt 25 898036.822 ± 8052.554 us/op TopNBenchmark.querySingleIncrementalIndex 1 rollup 75 basic.A 10 avgt 25 86100.936 ± 2844.073 us/op no-rollup: size [22387432] bytes. ordered-no-rollup: size [18195470] bytes. rollup: size [2206430] bytes. ``` heavy rollup: ``` Benchmark (numSegments) (rollupSchema) (rowsPerSegment) (schemaAndQuery) (threshold) Mode Cnt Score Error Units TopNBenchmark.querySingleIncrementalIndex 1 no-rollup 75 basic.A 10 avgt 25 888967.034 ± 25098.293 us/op TopNBenchmark.querySingleIncrementalIndex 1 ordered-no-rollup 75 basic.A 10 avgt 25 987568.305 ± 50955.718 us/op TopNBenchmark.querySingleIncrementalIndex 1 rollup 75 basic.A 10 avgt 258820.929 ± 699.516 us/op no-rollup: size [22383561] bytes. ordered-no-rollup: size [16900327] bytes. rollup: size [237206] bytes. ``` It would appear that performance difference is more notable when the `Deque` are deeper, at least for topN and timeseries, since previous benchmarks were basically comparing flat maps with the same number of keys and single element `Deque`. Size savings will likely vary quite wildly based on dimension order and correlated to how effective rollup would be if were enabled at default millisecond granularity. In this case, with a few low cardinality dimensions and 1-10k events per timestamp sizes were 20-25% smaller. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
[GitHub] clintropolis commented on issue #6066: Sorting rows when rollup is disabled
clintropolis commented on issue #6066: Sorting rows when rollup is disabled URL: https://github.com/apache/incubator-druid/issues/6066#issuecomment-410153475 I ran some additional benchmarks after realizing that the generated rows from previous benchmarks were rows with no opportunity for actual rollup to occur (all segments were approximately the same size for the numbers above). Here are timeseries benches with moderate rollup opportunity: ``` Benchmark(numSegments) (rollupSchema) (rowsPerSegment) (schemaAndQuery) Mode Cnt Score Error Units TimeseriesBenchmark.querySingleIncrementalIndex 1 no-rollup75 basic.A avgt 25 663840.128 ± 26363.127 us/op TimeseriesBenchmark.querySingleIncrementalIndex 1 ordered-no-rollup75 basic.A avgt 25 679784.179 ± 81577.842 us/op TimeseriesBenchmark.querySingleIncrementalIndex 1 rollup75 basic.A avgt 25 62446.589 ± 2224.296 us/op no-rollup: size [22387432] bytes. ordered-no-rollup: size [18195470] bytes. rollup: size [2206430] bytes. ``` and heavy rollup potential: ``` Benchmark(numSegments) (rollupSchema) (rowsPerSegment) (schemaAndQuery) Mode Cnt Score Error Units TimeseriesBenchmark.querySingleIncrementalIndex 1 no-rollup75 basic.A avgt 25 653316.845 ± 31964.338 us/op TimeseriesBenchmark.querySingleIncrementalIndex 1 ordered-no-rollup75 basic.A avgt 25 769623.711 ± 12299.182 us/op TimeseriesBenchmark.querySingleIncrementalIndex 1 rollup75 basic.A avgt 256545.777 ± 607.087 us/op no-rollup: size [22383561] bytes. ordered-no-rollup: size [16900327] bytes. rollup: size [237206] bytes. ``` and TopN: moderate rollup: ``` Benchmark (numSegments) (rollupSchema) (rowsPerSegment) (schemaAndQuery) (threshold) Mode Cnt Score Error Units TopNBenchmark.querySingleIncrementalIndex 1 no-rollup 75 basic.A 10 avgt 25 893805.325 ± 9592.710 us/op TopNBenchmark.querySingleIncrementalIndex 1 ordered-no-rollup 75 basic.A 10 avgt 25 898036.822 ± 8052.554 us/op TopNBenchmark.querySingleIncrementalIndex 1 rollup 75 basic.A 10 avgt 25 86100.936 ± 2844.073 us/op no-rollup: size [22387432] bytes. ordered-no-rollup: size [18195470] bytes. rollup: size [2206430] bytes. ``` heavy rollup: ``` Benchmark (numSegments) (rollupSchema) (rowsPerSegment) (schemaAndQuery) (threshold) Mode Cnt Score Error Units TopNBenchmark.querySingleIncrementalIndex 1 no-rollup 75 basic.A 10 avgt 25 888967.034 ± 25098.293 us/op TopNBenchmark.querySingleIncrementalIndex 1 ordered-no-rollup 75 basic.A 10 avgt 25 987568.305 ± 50955.718 us/op TopNBenchmark.querySingleIncrementalIndex 1 rollup 75 basic.A 10 avgt 258820.929 ± 699.516 us/op no-rollup: size [22383561] bytes. ordered-no-rollup: size [16900327] bytes. rollup: size [237206] bytes. ``` It would appear that performance difference is more notable when the `Deque` are deeper, at least for topN and timeseries, since previous benchmarks were basically comparing flat maps with the same number of keys and single element `Deque`. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org