Cascading Spark Structured streams

2017-12-28 Thread Eric Dain
I need to write a Spark Structured Streaming pipeline that involves
multiple aggregations, splitting data into multiple sub-pipes and union
them. Also it need to have stateful aggregation with timeout.

Spark Structured Streaming support all of the required functionality but
not as one stream. I did a proof of concept that divide the pipeline into 3
sub-streams cascaded using Kafka and it seems to work. But I was wondering
if it would be a good idea to skip Kafka and use HDFS files as integration.
Or maybe there is another way to cascade streams without needing extra
service like Kafka.

Thanks,


Re: Standalone Cluster: ClassNotFound org.apache.kafka.common.serialization.ByteArrayDeserializer

2017-12-28 Thread Shixiong(Ryan) Zhu
The cluster mode doesn't upload jars to the driver node. This is a known
issue: https://issues.apache.org/jira/browse/SPARK-4160

On Wed, Dec 27, 2017 at 1:27 AM, Geoff Von Allmen 
wrote:

> I’ve tried it both ways.
>
> Uber jar gives me gives me the following:
>
>- Caused by: java.lang.ClassNotFoundException: Failed to find data
>source: kafka. Please find packages at http://spark.apache.org/third-
>party-projects.html
>
> If I only do minimal packaging and add org.apache.spark_spark-sql-
> kafka-0-10_2.11-2.2.0.jar as a --package and then add it to the
> --driver-class-path then I get past that error, but I get the error I
> showed in the original post.
>
> I agree it seems it’s missing the kafka-clients jar file as that is where
> the ByteArrayDeserializer is, though it looks like it’s present as far as
> I can tell.
>
> I can see the following two packages in the ClassPath entries on the
> history server (Though the source shows: **(redacted) — not sure
> why?)
>
>- spark://:/jars/org.apache.kafka_kafka-clients-0.10.0.1.jar
>- spark://:/jars/org.apache.spark_spark-sql-kafka-
>0-10_2.11-2.2.0.jar
>
> As as side note, i’m running both a master and worker on the same system
> just to test out running in cluster mode. Not sure if that would have
> anything to do with it. I would think it would make it easier since it's
> got access to all the same file system... but I'm pretty new to Spark.
>
> I have also read through and followed those instructions as well as many
> others at this point.
>
> Thanks!
> ​
>
> On Wed, Dec 27, 2017 at 12:56 AM, Eyal Zituny 
> wrote:
>
>> Hi,
>> it seems that you're missing the kafka-clients jar (and probably some
>> other dependencies as well)
>> how did you packaged you application jar? does it includes all the
>> required dependencies (as an uber jar)?
>> if it's not an uber jar you need to pass via the driver-class-path and
>> the executor-class-path all the files\dirs where your dependencies can be
>> found (note that those must be accessible from each node in the cluster)
>> i suggest to go over the manual
>> 
>>
>> Eyal
>>
>>
>> On Wed, Dec 27, 2017 at 1:08 AM, Geoff Von Allmen > > wrote:
>>
>>> I am trying to deploy a standalone cluster but running into
>>> ClassNotFound errors.
>>>
>>> I have tried a whole myriad of different approaches varying from
>>> packaging all dependencies into a single JAR and using the --packages
>>> and --driver-class-path options.
>>>
>>> I’ve got a master node started, a slave node running on the same system,
>>> and am using spark submit to get the streaming job kicked off.
>>>
>>> Here is the error I’m getting:
>>>
>>> Exception in thread "main" java.lang.reflect.InvocationTargetException
>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>> at 
>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>>> at 
>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>> at java.lang.reflect.Method.invoke(Method.java:498)
>>> at 
>>> org.apache.spark.deploy.worker.DriverWrapper$.main(DriverWrapper.scala:58)
>>> at 
>>> org.apache.spark.deploy.worker.DriverWrapper.main(DriverWrapper.scala)
>>> Caused by: java.lang.NoClassDefFoundError: 
>>> org/apache/kafka/common/serialization/ByteArrayDeserializer
>>> at 
>>> org.apache.spark.sql.kafka010.KafkaSourceProvider$.(KafkaSourceProvider.scala:376)
>>> at 
>>> org.apache.spark.sql.kafka010.KafkaSourceProvider$.(KafkaSourceProvider.scala)
>>> at 
>>> org.apache.spark.sql.kafka010.KafkaSourceProvider.validateStreamOptions(KafkaSourceProvider.scala:323)
>>> at 
>>> org.apache.spark.sql.kafka010.KafkaSourceProvider.sourceSchema(KafkaSourceProvider.scala:60)
>>> at 
>>> org.apache.spark.sql.execution.datasources.DataSource.sourceSchema(DataSource.scala:198)
>>> at 
>>> org.apache.spark.sql.execution.datasources.DataSource.sourceInfo$lzycompute(DataSource.scala:88)
>>> at 
>>> org.apache.spark.sql.execution.datasources.DataSource.sourceInfo(DataSource.scala:88)
>>> at 
>>> org.apache.spark.sql.execution.streaming.StreamingRelation$.apply(StreamingRelation.scala:30)
>>> at 
>>> org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:150)
>>> at com.Customer.start(Customer.scala:47)
>>> at com.Main$.main(Main.scala:23)
>>> at com.Main.main(Main.scala)
>>> ... 6 more
>>> Caused by: java.lang.ClassNotFoundException: 
>>> org.apache.kafka.common.serialization.ByteArrayDeserializer
>>> at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
>>> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>>> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:335)
>>> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>>> ... 18 more

Re: Spark on EMR suddenly stalling

2017-12-28 Thread Gourav Sengupta
Hi Jeroen,

can you try to then use the EMR version 5.10 instead or EMR version 5.11
instead?
can you please try selecting a subnet which is in a different availability
zone?
if possible just try to increase the number of task instances and see the
difference?
also in case you are using caching, try to see the total amount of space
being used, you may also want to persist intermediate data into S3 as
default parquet format in worst case scenario and then work through the
steps that you think are failing using Jupyter or Spark notebook.
Also can you please report the number of containers that your job is
creating by looking at the metrics in the EMR console?

Also if you see the spark UI then you can easily see which particular step
is taking the longest period of time - you just have to drill in a bit in
order to see that. Generally in case shuffling is an issue then it
definitely appears in the SPARK UI as I drill into the steps and see which
particular one is taking the longest.


Since you do not have a long running cluster (which I mistook from your
statement of a long running job) therefore things should be fine.


Regards,
Gourav Sengupta


On Thu, Dec 28, 2017 at 7:43 PM, Jeroen Miller 
wrote:

> On 28 Dec 2017, at 19:42, Gourav Sengupta 
> wrote:
> > In the EMR cluster what are the other applications that you have enabled
> (like HIVE, FLUME, Livy, etc).
>
> Nothing that I can think of, just a Spark step (unless EMR is doing fancy
> stuff behind my back).
>
> > Are you using SPARK Session?
>
> Yes.
>
> > If yes is your application using cluster mode or client mode?
>
> Cluster mode.
>
> > Have you read the EC2 service level agreement?
>
> I did not -- I doubt it has the answer to my problem though! :-)
>
> > Is your cluster on auto scaling group?
>
> Nope.
>
> > Are you scheduling your job by adding another new step into the EMR
> cluster? Or is it the same job running always triggered by some background
> process?
> > Since EMR are supposed to be ephemeral, have you tried creating a new
> cluster and trying your job in that?
>
> I'm creating a new cluster on demand, specifically for that job. No other
> application runs on it.
>
> JM
>
>


Re: Custom Data Source for getting data from Rest based services

2017-12-28 Thread vaish02
We extensively use pubmed & clinical trial databases for our work and it
involves making large amount of parametric rest api queries, usually if the
data download is large the requests get timed out ad we have to run queries
in very small batches . We also extensively use large number(thousands) of
NLP queries for our ML work. 
  Given that our content is quite large and we are constrained by the
public database interfaces, such a framework would be very beneficial for
our use case. Since I just stumbled on this post will try to use this
package in context of our framework and let you know the difference between
using the library vs the way we do it conventionally. Thanks for sharing it
with the community.



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Spark on EMR suddenly stalling

2017-12-28 Thread Jeroen Miller
On 28 Dec 2017, at 19:42, Gourav Sengupta  wrote:
> In the EMR cluster what are the other applications that you have enabled 
> (like HIVE, FLUME, Livy, etc).

Nothing that I can think of, just a Spark step (unless EMR is doing fancy stuff 
behind my back).

> Are you using SPARK Session?

Yes.

> If yes is your application using cluster mode or client mode?

Cluster mode.

> Have you read the EC2 service level agreement?

I did not -- I doubt it has the answer to my problem though! :-)

> Is your cluster on auto scaling group?

Nope.

> Are you scheduling your job by adding another new step into the EMR cluster? 
> Or is it the same job running always triggered by some background process?
> Since EMR are supposed to be ephemeral, have you tried creating a new cluster 
> and trying your job in that?

I'm creating a new cluster on demand, specifically for that job. No other 
application runs on it.

JM


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Spark on EMR suddenly stalling

2017-12-28 Thread Jeroen Miller
On 28 Dec 2017, at 19:40, Maximiliano Felice  
wrote:
> I experienced a similar issue a few weeks ago. The situation was a result of 
> a mix of speculative execution and OOM issues in the container.

Interesting! However I don't have any OOM exception in the logs. Does that rule 
out your hypothesis?

> We've managed to check that when we have speculative execution enabled and 
> some YARN containers which were running speculative tasks died, they did take 
> a chance from the max-attempts number. This wouldn't represent any issue in 
> normal behavior, but it seems that if all the retries were consumed in a task 
> that has started speculative execution, the application itself doesn't fail, 
> but it hangs the task expecting to reschedule it sometime. As the attempts 
> are zero, it never reschedules it and the application itself fails to finish.

Hmm, this sounds like a huge design fail to me, but I'm sure there are very 
complicated issues that go way over my head.

> 1. Check the number of tasks scheduled. If you see one (or more) tasks 
> missing when you do the final sum, then you might be encountering this issue.
> 2. Check the container logs to see if anything broke. OOM is what failed to 
> me.

I can't find anything in the logs from EMR. Should I expect to find explicit 
OOM exception messages? 

JM


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Fwd: Spark on EMR suddenly stalling

2017-12-28 Thread Jeroen Miller
On 28 Dec 2017, at 19:25, Patrick Alwell  wrote:
> You are using groupByKey() have you thought of an alternative like 
> aggregateByKey() or combineByKey() to reduce shuffling?

I am aware of this indeed. I do have a groupByKey() that is difficult to avoid, 
but the problem occurs afterwards.

> Dynamic allocation is great; but sometimes I’ve found explicitly setting the 
> num executors, cores per executor, and memory per executor to be a better 
> alternative.

I will try with dynamic allocation off.

> Take a look at the yarn logs as well for the particular executor in question. 
> Executors can have multiple tasks; and will often fail if they have more 
> tasks than available threads.

The trouble is there is nothing significant in the logs (read: that is clear 
enough for me to understand!). Any special message I could grep for?

> [...] https://spark.apache.org/docs/latest/tuning.html#level-of-parallelism
> [...] https://spark.apache.org/docs/latest/hardware-provisioning.html

Thanks for the pointers -- will have a look!

Jeroen



-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Spark on EMR suddenly stalling

2017-12-28 Thread Gourav Sengupta
HI Jeroen,

Can I get a few pieces of additional information please?

In the EMR cluster what are the other applications that you have enabled
(like HIVE, FLUME, Livy, etc).
Are you using SPARK Session? If yes is your application using cluster mode
or client mode?
Have you read the EC2 service level agreement?
Is your cluster on auto scaling group?
Are you scheduling your job by adding another new step into the EMR
cluster? Or is it the same job running always triggered by some background
process?
Since EMR are supposed to be ephemeral, have you tried creating a new
cluster and trying your job in that?


Regards,
Gourav Sengupta

On Thu, Dec 28, 2017 at 4:06 PM, Jeroen Miller 
wrote:

> Dear Sparkers,
>
> Once again in times of desperation, I leave what remains of my mental
> sanity to this wise and knowledgeable community.
>
> I have a Spark job (on EMR 5.8.0) which had been running daily for months,
> if not the whole year, with absolutely no supervision. This changed all of
> sudden for reasons I do not understand.
>
> The volume of data processed daily has been slowly increasing over the
> past year but has been stable in the last couple months. Since I'm only
> processing the past 8 days's worth of data I do not think that increased
> data volume is to blame here. Yes, I did check the volume of data for the
> past few days.
>
> Here is a short description of the issue.
>
> - The Spark job starts normally and proceeds successfully with the first
> few stages.
> - Once we reach the dreaded stage, all tasks are performed successfully
> (they typically take not more than 1 minute each), except for the /very/
> first one (task 0.0) which never finishes.
>
> Here is what the log looks like (simplified for readability):
>
> 
> INFO TaskSetManager: Finished task 243.0 in stage 4.0 (TID 929) in 49412
> ms on ... (executor 12) (254/256)
> INFO TaskSetManager: Finished task 255.0 in stage 4.0 (TID 941) in 48394
> ms on ... (executor 7) (255/256)
> INFO ExecutorAllocationManager: Request to remove executorIds: 14
> INFO YarnClusterSchedulerBackend: Requesting to kill executor(s) 14
> INFO YarnClusterSchedulerBackend: Actual list of executor(s) to be killed
> is 14
> INFO YarnAllocator: Driver requested a total number of 0 executor(s).
> 
>
> Why is that? There is still a task waiting to be completed right? Isn't an
> executor needed for that?
>
> Afterwards, all executors are getting killed (dynamic allocation is turned
> on):
>
> 
> INFO ApplicationMaster$AMEndpoint: Driver requested to kill executor(s) 14.
> INFO ExecutorAllocationManager: Removing executor 14 because it has been
> idle for 60 seconds (new desired total will be 5)
> .
> .
> .
> INFO ExecutorAllocationManager: Request to remove executorIds: 7
> INFO YarnClusterSchedulerBackend: Requesting to kill executor(s) 7
> INFO YarnClusterSchedulerBackend: Actual list of executor(s) to be killed
> is 7
> INFO ApplicationMaster$AMEndpoint: Driver requested to kill executor(s) 7.
> INFO ExecutorAllocationManager: Removing executor 7 because it has been
> idle for 60 seconds (new desired total will be 1)
> INFO YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 7.
> INFO DAGScheduler: Executor lost: 7 (epoch 4)
> INFO BlockManagerMasterEndpoint: Trying to remove executor 7 from
> BlockManagerMaster.
> INFO YarnClusterScheduler: Executor 7 on ... killed by driver.
> INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(7,
> ..., 44289, None)
> INFO BlockManagerMaster: Removed 7 successfully in removeExecutor
> INFO ExecutorAllocationManager: Existing executor 7 has been removed (new
> total is 1)
> 
>
> Then, there's nothing more in the driver's log. Nothing. The cluster then
> run for hours, with no progress being made, and no executors allocated.
>
> Here is what I tried:
>
> - More memory per executor: from 13 GB to 24 GB by increments.
> - Explicit repartition() on the RDD: from 128 to 256 partitions.
>
> The offending stage used to be a rather innocent looking keyBy(). After
> adding some repartition() the offending stage was then a mapToPair().
> During my last experiments, it turned out the repartition(256) itself is
> now the culprit.
>
> I like Spark, but its mysteries will manage to send me in a mental
> hospital one of those days.
>
> Can anyone shed light on what is going on here, or maybe offer some
> suggestions or pointers to relevant source of information?
>
> I am completely clueless.
>
> Seasons greetings,
>
> Jeroen
>
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: Spark on EMR suddenly stalling

2017-12-28 Thread Maximiliano Felice
Hi Jeroen,

I experienced a similar issue a few weeks ago. The situation was a result
of a mix of speculative execution and OOM issues in the container.

First of all, when an executor takes too much time in Spark, it is handled
by the YARN speculative execution, which will launch a new executor and
allocate it in a new container. In our case, some tasks were throwing OOM
exceptions while executing, but not on the executor itself, *but on the
YARN container.*

It turns out that YARN will try several times to run an application when
something fails in it. Specifically, it will try
*yarn.resourcemanager.am.max-attempts* times to run the application before
failing, which has a default value of 2 and is not modified in EMR
configurations (check here

).

We've managed to check that when we have speculative execution enabled and
some YARN containers which were running speculative tasks died, they did
take a chance from the *max-attempts *number. This wouldn't represent any
issue in normal behavior, but it seems that if all the retries were
consumed in a task that has started speculative execution, the application
itself doesn't fail, but it hangs the task expecting to reschedule it
sometime. As the attempts are zero, it never reschedules it and the
application itself fails to finish.

I checked this theory repeatedly, always getting the expected results.
Several times I changed the named YARN configuration and it always starts
speculative retries on this task and hangs when reaching max-attempts
number of broken YARN containers.

I personally think that this issue should be possible to reproduce without
the speculative execution configured.

So, what would I do if I were you?

1. Check the number of tasks scheduled. If you see one (or more) tasks
missing when you do the final sum, then you might be encountering this
issue.
2. Check the *container* logs to see if anything broke. OOM is what failed
to me.
3. Contact AWS EMR support, although in my experience they were of no help
at all.


Hope this helps you a bit!



2017-12-28 14:57 GMT-03:00 Jeroen Miller :

> On 28 Dec 2017, at 17:41, Richard Qiao  wrote:
> > Are you able to specify which path of data filled up?
>
> I can narrow it down to a bunch of files but it's not so straightforward.
>
> > Any logs not rolled over?
>
> I have to manually terminate the cluster but there is nothing more in the
> driver's log when I check it from the AWS console when the cluster is still
> running.
>
> JM
>
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: Spark on EMR suddenly stalling

2017-12-28 Thread Patrick Alwell
Joren,

Anytime there is a shuffle in the network, Spark moves to a new stage. It seems 
like you are having issues either pre or post shuffle. Have you looked at a 
resource management tool like ganglia to determine if this is a memory or 
thread related issue? The spark UI?

You are using groupByKey() have you thought of an alternative like 
aggregateByKey() or combineByKey() to reduce shuffling?
https://umbertogriffo.gitbooks.io/apache-spark-best-practices-and-tuning/content/avoid_groupbykey_when_performing_an_associative_re/avoid-groupbykey-when-performing-a-group-of-multiple-items-by-key.html

Dynamic allocation is great; but sometimes I’ve found explicitly setting the 
num executors, cores per executor, and memory per executor to be a better 
alternative.

Take a look at the yarn logs as well for the particular executor in question. 
Executors can have multiple tasks; and will often fail if they have more tasks 
than available threads.

As for partitioning the data; you could also look into your level of 
parallelism which is correlated to the splitablity (blocks) of data. This will 
be based on your largest RDD.
https://spark.apache.org/docs/latest/tuning.html#level-of-parallelism

Spark is like C/C++ you need to manage the memory buffer or the compiler will 
through you out  ;)
https://spark.apache.org/docs/latest/hardware-provisioning.html

Hang in there, this is the more complicated stage of placing a spark 
application into production. The Yarn logs should point you in the right 
direction.

It’s tough to debug over email, so hopefully this information is helpful.

-Pat


On 12/28/17, 9:57 AM, "Jeroen Miller"  wrote:

On 28 Dec 2017, at 17:41, Richard Qiao  wrote:
> Are you able to specify which path of data filled up?

I can narrow it down to a bunch of files but it's not so straightforward.

> Any logs not rolled over?

I have to manually terminate the cluster but there is nothing more in the 
driver's log when I check it from the AWS console when the cluster is still 
running. 

JM


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org




-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Spark on EMR suddenly stalling

2017-12-28 Thread Jeroen Miller
On 28 Dec 2017, at 17:41, Richard Qiao  wrote:
> Are you able to specify which path of data filled up?

I can narrow it down to a bunch of files but it's not so straightforward.

> Any logs not rolled over?

I have to manually terminate the cluster but there is nothing more in the 
driver's log when I check it from the AWS console when the cluster is still 
running. 

JM


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Spark on EMR suddenly stalling

2017-12-28 Thread Jeroen Miller
Dear Sparkers,

Once again in times of desperation, I leave what remains of my mental sanity to 
this wise and knowledgeable community.

I have a Spark job (on EMR 5.8.0) which had been running daily for months, if 
not the whole year, with absolutely no supervision. This changed all of sudden 
for reasons I do not understand.

The volume of data processed daily has been slowly increasing over the past 
year but has been stable in the last couple months. Since I'm only processing 
the past 8 days's worth of data I do not think that increased data volume is to 
blame here. Yes, I did check the volume of data for the past few days.

Here is a short description of the issue.

- The Spark job starts normally and proceeds successfully with the first few 
stages.
- Once we reach the dreaded stage, all tasks are performed successfully (they 
typically take not more than 1 minute each), except for the /very/ first one 
(task 0.0) which never finishes.

Here is what the log looks like (simplified for readability):


INFO TaskSetManager: Finished task 243.0 in stage 4.0 (TID 929) in 49412 ms on 
... (executor 12) (254/256)
INFO TaskSetManager: Finished task 255.0 in stage 4.0 (TID 941) in 48394 ms on 
... (executor 7) (255/256)
INFO ExecutorAllocationManager: Request to remove executorIds: 14
INFO YarnClusterSchedulerBackend: Requesting to kill executor(s) 14
INFO YarnClusterSchedulerBackend: Actual list of executor(s) to be killed is 14
INFO YarnAllocator: Driver requested a total number of 0 executor(s).


Why is that? There is still a task waiting to be completed right? Isn't an 
executor needed for that?

Afterwards, all executors are getting killed (dynamic allocation is turned on):


INFO ApplicationMaster$AMEndpoint: Driver requested to kill executor(s) 14.
INFO ExecutorAllocationManager: Removing executor 14 because it has been idle 
for 60 seconds (new desired total will be 5)
.
.
.
INFO ExecutorAllocationManager: Request to remove executorIds: 7
INFO YarnClusterSchedulerBackend: Requesting to kill executor(s) 7
INFO YarnClusterSchedulerBackend: Actual list of executor(s) to be killed is 7
INFO ApplicationMaster$AMEndpoint: Driver requested to kill executor(s) 7.
INFO ExecutorAllocationManager: Removing executor 7 because it has been idle 
for 60 seconds (new desired total will be 1)
INFO YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 7.
INFO DAGScheduler: Executor lost: 7 (epoch 4)
INFO BlockManagerMasterEndpoint: Trying to remove executor 7 from 
BlockManagerMaster.
INFO YarnClusterScheduler: Executor 7 on ... killed by driver.
INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(7, ..., 
44289, None)
INFO BlockManagerMaster: Removed 7 successfully in removeExecutor
INFO ExecutorAllocationManager: Existing executor 7 has been removed (new total 
is 1)


Then, there's nothing more in the driver's log. Nothing. The cluster then run 
for hours, with no progress being made, and no executors allocated.

Here is what I tried:

- More memory per executor: from 13 GB to 24 GB by increments.
- Explicit repartition() on the RDD: from 128 to 256 partitions.

The offending stage used to be a rather innocent looking keyBy(). After adding 
some repartition() the offending stage was then a mapToPair(). During my last 
experiments, it turned out the repartition(256) itself is now the culprit.

I like Spark, but its mysteries will manage to send me in a mental hospital one 
of those days.

Can anyone shed light on what is going on here, or maybe offer some suggestions 
or pointers to relevant source of information?

I am completely clueless.

Seasons greetings,

Jeroen


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Pyspark and searching items from data structures

2017-12-28 Thread Esa Heikkinen
Hi

I would want to build pyspark-application, which searches sequential items or 
events of time series from csv-files.

What are the best data structures for this purpose ? Dataframe of pyspark or 
pandas, or RDD or SQL or something else ?

---
Esa


Re: Reading data from OpenTSDB or KairosDB

2017-12-28 Thread marko

Hello,

Thanks for your answer.

And what do you think about approach of querying data using 
OpenTSDB/KairosDB piece by piece, creating a dataframe for each piece, 
and then making a union out of them?


This would enable us to store and query data as timeseries and process 
it using Spark?


Best regards,
Marko


On 12/21/2017 12:48 PM, Jörn Franke wrote:

There are datasource for Cassandra and hbase, however I am not sure how useful 
they are, because then you need to do also implement the logic of opentsdb or 
kairosdb.

Better to implement your own data sources.
Then, there are several projects enabling timeseries queries in Spark, but I am 
not sure how much they are used. Additionally time series analysis may imply a 
lot of different functionality (eg matching time stamps), which those they may 
not have.


On 21. Dec 2017, at 12:27, marko  wrote:

Hello everyone,

I would like to know whether there is a way to read time-series data from 
OpenTSDB (built on top of HBase) or KairosDB (built on top of Cassandra) to a 
Spark DataFrame (or RDD's) ?

Best regards,
Marko


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org




-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org