Hello all,
How do you log what is happening inside your Spark Dataframe pipelines?
I would like to collect statistics along the way, mostly count of rows at
particular steps, to see where rows where filtered and what not. Is there
any other way to do this than calling .count on the dataframe?
R
I can *imagine* writing some sort of DataframeReader-generation tool, but
am not aware of one that currently exists.
On Tue, Apr 2, 2019 at 13:08 Surendra , Manchikanti <
surendra.manchika...@gmail.com> wrote:
>
> Looking for a generic solution, not for a specific DB or number of tables.
>
>
> On
Looking for a generic solution, not for a specific DB or number of tables.
On Fri, Mar 29, 2019 at 5:04 AM Jason Nerothin
wrote:
> How many tables? What DB?
>
> On Fri, Mar 29, 2019 at 00:50 Surendra , Manchikanti <
> surendra.manchika...@gmail.com> wrote:
>
>> Hi Jason,
>>
>> Thanks for your r
To add more info, this project is on an older version of Spark, 1.5.0, and
on an older version of Kafka which is 0.8.2.1 (2.10-0.8.2.1).
On Tue, Apr 2, 2019 at 11:39 AM Dmitry Goldenberg
wrote:
> Hi,
>
> I've got 3 questions/issues regarding checkpointing, was hoping someone
> could help shed so
Hi,
I've got 3 questions/issues regarding checkpointing, was hoping someone
could help shed some light on this.
We've got a Spark Streaming consumer consuming data from a Kafka topic;
works fine generally until I switch it to the checkpointing mode by calling
the 'checkpoint' method on the contex
I am still struggling with getting fit() to work on my dataset.
The Spark ML exception that is the issue is:
LAPACK.dppsv returned 6 because A is not positive definite. Is A derived from a
singular matrix (e.g. collinear column values)?
Comparing my standardized Weight values with the tutorial's
Hello,
I want to ask if there any way to measure HDFS data loading time at
the start of my program. I tried to add an action e.g count() after val
data = sc.textFile() call. But I notice that my program takes more time
to finish than before adding count call. Is there any other way to do i
Hello All ,
I am interested to use bisecting k-means algorithm implemented in spark.
While using bisecting k-means I found that some of my clustering requests
on large data-set failed with OOM issues.
As data-set size is expected to be large , so I wanted to use some
pre-processing steps to reduc