will end up with
> 10's or 100's of TBs of data and I feel that NoSQL will be much quicker
> than Hadoop/Spark. This is time series data that are coming from many
> devices in form of flat files and it is currently extracted / transformed
> /loaded
> into another database
or TSDB ? We receive 1 mil meters x 288 readings =
> 288 mil rows (Approx. 360 GB per day) – Therefore, we will end up with 10's
> or 100's of TBs of data and I feel that NoSQL will be much quicker than
> Hadoop/Spark. This is time series data that are coming from many devices in
ch quicker thanHadoop/Spark. This
is time series data that are coming from many devices in form of flat files and
it is currently extracted / transformed /loaded into another database which is
connected to BI tools. We might use azure data factory to collect the flat
files and then use spark to
Cc: Prateek . ; user@spark.apache.org
Subject: Re: Spark job for Reading time series data from Cassandra
Hi,
the spark connector docs say:
(https://github.com/datastax/spark-cassandra-connector/blob/master/doc/FAQ.md)
"The number of Spark partitions(tasks) created is directly controlled b
Hi,
the spark connector docs say: (
https://github.com/datastax/spark-cassandra-connector/blob/master/doc/FAQ.md
)
"The number of Spark partitions(tasks) created is directly controlled by
the setting spark.cassandra.input.split.size_in_mb. This number reflects
the approximate amount of Cassandra
Prateek,
I believe that one task is created per Cassandra partition. How is your
data partitioned?
Regards,
Bryan Jeffrey
On Thu, Mar 10, 2016 at 10:36 AM, Prateek . wrote:
> Hi,
>
>
>
> I have a Spark Batch job for reading timeseries data from Cassandra which
> has 50,000 rows.
>
>
>
>
>
>
Hi,
I have a Spark Batch job for reading timeseries data from Cassandra which has
50,000 rows.
JavaRDD cassandraRowsRDD = javaFunctions.cassandraTable("iotdata",
"coordinate")
.map(new Function() {
@Override
public String call(CassandraRo
Could you try this?
df.groupBy(cast(col("timeStamp") - start) / bucketLengthSec,
IntegerType)).agg(max("timestamp"), max("value")).collect()
On Wed, Dec 9, 2015 at 8:54 AM, Arun Verma wrote:
> Hi all,
>
> We have RDD(main) of sorted time-series data. We
c 9, 2015 at 10:26 PM, Sean Owen wrote:
>
>> CC Sandy as his https://github.com/cloudera/spark-timeseries might be
>> of use here.
>>
>> On Wed, Dec 9, 2015 at 4:54 PM, Arun Verma
>> wrote:
>> > Hi all,
>> >
>> > We have RDD(main)
rma
> wrote:
> > Hi all,
> >
> > We have RDD(main) of sorted time-series data. We want to split it into
> > different RDDs according to window size and then perform some aggregation
> > operation like max, min etc. over each RDD in parallel.
> >
> > If window
CC Sandy as his https://github.com/cloudera/spark-timeseries might be
of use here.
On Wed, Dec 9, 2015 at 4:54 PM, Arun Verma wrote:
> Hi all,
>
> We have RDD(main) of sorted time-series data. We want to split it into
> different RDDs according to window size and then perform some
Hi all,
*We have RDD(main) of sorted time-series data. We want to split it into
different RDDs according to window size and then perform some aggregation
operation like max, min etc. over each RDD in parallel.*
If window size is w then ith RDD has data from (startTime + (i-1)*w) to
(startTime
case being the ID[as key] and the grouped by key
> features). But for the regression models, it was not possible because the
> functions need RDDs and my solution would be map each element (grouped as
> time series) to a function of training. How can I deal with time series
> data in
Hi everyone!
I am working with multiple time series data and in summary I have to adjust
each time series (like inserting average values in data gaps) and then
training regression models with mllib for each time series. The adjustment
step I did with the adjustement function being mapped for each
Spark Streaming.
>> Each data has a date/time dimension and I want to write data within the
>> same time dimension to the same hdfs directory. The data stream might be
>> unordered (by time dimension).
>>
>> I'm wondering what are the best practices in grouping/sto
Each data has a date/time dimension and I want to write data within the same
> time dimension to the same hdfs directory. The data stream might be unordered
> (by time dimension).
>
> I'm wondering what are the best practices in grouping/storing time series
> data st
dimension).
>
> I'm wondering what are the best practices in grouping/storing time series
> data stream using Spark Streaming?
>
> I'm considering grouping each batch of data in Spark Streaming per time
> dimension and then saving each group to different hdfs directories. H
ering what are the best practices in grouping/storing time series
data stream using Spark Streaming?
I'm considering grouping each batch of data in Spark Streaming per time
dimension and then saving each group to different hdfs directories. However
since it is possible for data with the
name to decide which partition it goes into. You'd need to
>> make corresponding changes for HadoopPartition & the compute() method.
>>
>> (or if you can't subclass HadoopRDD directly you can use it for
>> inspiration.)
>>
>> On Mon, Mar 9, 2015 at 11
r
inspiration.)
On Mon, Mar 9, 2015 at 11:18 AM, Shuai Zheng wrote:
> Hi All,
>
>
>
> If I have a set of time series data files, they are in parquet format and
> the data for each day are store in naming convention, but I will not know
> how many files for one day.
&
Hi All,
If I have a set of time series data files, they are in parquet format and
the data for each day are store in naming convention, but I will not know
how many files for one day.
20150101a.parq
20150101b.parq
20150102a.parq
20150102b.parq
20150102c.parq
.
201501010a.parq
what are you trying to do? generate time series from your data in HDFS, or
doing
some transformation and/or aggregation from your time series data in HDFS?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Dealing-with-Time-Series-Data-tp14275p14482.html
Sent
I have a use case for our data in HDFS that involves sorting chunks of data
into time series format by a specific characteristic and doing computations
from that. At large scale, what is the most efficient way to do this?
Obviously, having the data sharded by that characteristic would make the
pe
Hi all,
I have a table containing historical time series data. I know the logging
frequency for the same. Is there any way to write UDFs to count the total
number of missing data in Spark?
I am new to Spark, and this question might be Naive. But, a piece of
code/resource might help me jump start
24 matches
Mail list logo