Hi,
We use monotonically_increasing_id() as well, but just cache the table first
like Ankur suggested. With that method, we get the same keys in all derived
tables.
Thanks,
Subhash
Sent from my iPhone
> On Apr 7, 2017, at 7:32 PM, Everett Anderson wrote:
>
> Hi,
Hi,
Thanks, but that's using a random UUID. Certainly unlikely to have
collisions, but not guaranteed.
I'd rather prefer something like monotonically_increasing_id or RDD's
zipWithUniqueId but with better behavioral characteristics -- so they don't
surprise people when 2+ outputs derived from an
You can use zipWithIndex or the approach Tim suggested or even the one you
are using but I believe the issue is that tableA is being materialized
every time you for the new transformations. Are you caching/persisting the
table A? If you do that you should not see this behavior.
Thanks
Ankur
On
http://stackoverflow.com/questions/37231616/add-a-new-column-to-a-dataframe-new-column-i-want-it-to-be-a-uuid-generator
On Fri, Apr 7, 2017 at 3:56 PM, Everett Anderson
wrote:
> Hi,
>
> What's the best way to assign a truly unique row ID (rather than a hash)
> to a
Hi,
What's the best way to assign a truly unique row ID (rather than a hash) to
a DataFrame/Dataset?
I originally thought that functions.monotonically_increasing_id would do
this, but it seems to have a rather unfortunate property that if you add it
as a column to table A and then derive tables
Hi There,
Using spark-mllib_2.11-2.1.0. Facing issue that
BucketedRandomProjectionLSHModel.approxNearestNeighbors returns one result,
always.
Dataset looks like:
+++-++--+
| id|
Is anyone using structured streaming and writing the results to Cassandra
database in a production environment?
I do not think I have enough expertise to write a custom sink that can be
used in production environment. Please help!
Hi Kant,
If you are interested in using Spark alongside a database to serve real
time queries, there are many options. Almost every popular database has
built some sort of connector to Spark. I've listed a majority of them and
tried to delineate them in some way in this StackOverflow answer:
Definitely agree with gourav there. I wouldn't want jenkins to run my work
flow. Seems to me that you would only be using jenkins for its scheduling
capabilities
Yes you can run tests but you wouldn't want it to run your orchestration of
jobs
What happens if jenkijs goes down for any particular
I'd like to eventually contribute to spark, but I'm noticing since spark 2
the query planner is heavily used throughout Dataset code base. Are there
any sites I can go to that explain the technical details, more than just
from a high-level prospective
Hi Stephen,
If you use aggregate functions or reduceGroup on KeyValueGroupedDataset it
behaves as reduceByKey on RDD.
Only if you use flatMapGroups and mapGroups it behaves as groupByKey on
RDD and if you read the API documentation it warns of using the API.
Hope this helps.
Thanks
Ankur
On
Hi Steve,
Why would you ever do that? You are suggesting the use of a CI tool as a
workflow and orchestration engine.
Regards,
Gourav Sengupta
On Fri, Apr 7, 2017 at 4:07 PM, Steve Loughran
wrote:
> If you have Jenkins set up for some CI workflow, that can do scheduled
Maybe using ranger or sentry would be the better choice to intercept those
calls?
> On 7. Apr 2017, at 16:32, Alvaro Brandon wrote:
>
> I was going through the SparkContext.textFile() and I was wondering at that
> point does Spark communicates with HDFS. Since when
On 7 Apr 2017, at 15:32, Alvaro Brandon
> wrote:
I was going through the SparkContext.textFile() and I was wondering at that
point does Spark communicates with HDFS. Since when you download Spark binaries
you also specify the Hadoop
If you have Jenkins set up for some CI workflow, that can do scheduled builds
and tests. Works well if you can do some build test before even submitting it
to a remote cluster
On 7 Apr 2017, at 10:15, Sam Elamin
> wrote:
Hi Shyla
You
As part of my processing, I have the following code:
rdd = sc.wholeTextFiles("s3://paulhtremblay/noaa_tmp/", 10)
rdd.count()
The s3 directory has about 8GB of data and 61,878 files. I am using Spark
2.1, and running it with 15 modes of m3.xlarge nodes on EMR.
The job fails with this error:
I was going through the SparkContext.textFile() and I was wondering at that
point does Spark communicates with HDFS. Since when you download Spark
binaries you also specify the Hadoop version you will use, I'm guessing it
has its own client that calls HDFS wherever you specify it in the
Are there plans to add reduceByKey to dataframes, Since switching over to
spark 2 I find myself increasing dissatisfied with the idea of converting
dataframes to RDD to do procedural programming on grouped data(both from a
ease of programming stance and performance stance). So I've been using
Hi Devs,
I've some case classes here, and it's fields are all optional
case class A(b:Option[B] = None, c: Option[C] = None, ...)
If I read some data in a DataSet and try to connvert it to this case class
using the as method, it doesn't give me any answer, it simple freeze.
If I change the case
How do you read them?
> On 7. Apr 2017, at 12:11, Jacek Laskowski wrote:
>
> Hi,
>
> If your Spark app uses snappy in the code, define an appropriate library
> dependency to have it on classpath. Don't rely on transitive dependencies.
>
> Jacek
>
> On 7 Apr 2017 8:34
It's true that CrossValidator is not parallel currently - see
https://issues.apache.org/jira/browse/SPARK-19357 and feel free to help
review.
On Fri, 7 Apr 2017 at 14:18 Aseem Bansal wrote:
>
>- Limited the data to 100,000 records.
>- 6 categorical feature which go
- Limited the data to 100,000 records.
- 6 categorical feature which go through imputation, string indexing,
one hot encoding. The maximum classes for the feature is 100. As data is
imputated it becomes dense.
- 1 numerical feature.
- Training Logistic Regression through
What is the size of training data (number examples, number features)? Dense
or sparse features? How many classes?
What commands are you using to submit your job via spark-submit?
On Fri, 7 Apr 2017 at 13:12 Aseem Bansal wrote:
> When using spark ml's LogisticRegression,
Hi Yash,
Thank you for the response.
Sorry it was not at distinct but it was at a join stage.
It was a self join. There were no errors and the jobs were stuck at the
step for a around 7 hrs, the last message that came through was .
*ShuffleBlockFetcherIterator: Started 4 remote fetches*
Thanks,
When using spark ml's LogisticRegression, RandomForest, CrossValidator etc.
do we need to give any consideration while coding in making it scale with
more CPUs or does it scale automatically?
I am reading some data from S3, using a pipeline to train a model. I am
running the job on a spark
Hi All,
Is checkpointing in Spark Streaming Synchronous or Asynchronous ? other
words can spark continue processing the stream while checkpointing?
Thanks!
Hi,
What's the alternative? Dataset? You've got textFile then.
It's an older API from the ages when Dataset was merely experimental.
Jacek
On 29 Mar 2017 8:58 p.m., "George Obama" wrote:
> Hi,
>
> I saw that the API, either R or Scala, we are returning DataFrame for
>
Hi,
If your Spark app uses snappy in the code, define an appropriate library
dependency to have it on classpath. Don't rely on transitive dependencies.
Jacek
On 7 Apr 2017 8:34 a.m., "satishl" wrote:
Hi, I am planning to process spark app eventlogs with another spark
Try the following
spark-shell --master yarn-client --name nayan /opt/packages/-data-
prepration/target/scala-2.10/-data-prepration-assembly-1.0.jar
On Thu, Apr 6, 2017 at 6:36 PM, nayan sharma
wrote:
> Hi All,
> I am getting error while loading CSV file.
>
>
oops sorry. Please ignore this. wrong mailing list
Hi All,
I read the docs however I still have the following question For Stateful
stream processing is HDFS mandatory? because In some places I see it is
required and other places I see that rocksDB can be used. I just want to
know if HDFS is mandatory for Stateful stream processing?
Thanks!
Hi Shyla
You have multiple options really some of which have been already listed but
let me try and clarify
Assuming you have a spark application in a jar you have a variety of options
You have to have an existing spark cluster that is either running on EMR or
somewhere else.
*Super simple /
Sorry buddy, didn't get your question quite right.
Just to test, I created a scala class with spark csv and it seemed to work.
Donno if that would help much, but here are the env details:
EMR 2.7.3
scalaVersion := "2.11.8"
Spark version 2.0.2
On Fri, 7 Apr 2017 at 17:51 nayan sharma
Hi Yash,
I know this will work perfect but here I wanted to read the csv using the
assembly jar file.
Thanks,
Nayan
> On 07-Apr-2017, at 10:02 AM, Yash Sharma wrote:
>
> Hi Nayan,
> I use the --packages with the spark shell and the spark submit. Could you
> please try
Hi, I am planning to process spark app eventlogs with another spark app.
These event logs are saved with snappy compression (extension: .snappy).
When i read the file in a new spark app - i get a snappy library not found
error. I am confused as to how can spark write eventlog in snappy format
yes Lineage that is actually replayable is what is needed for Validation
process. So we can address questions like how a system arrived at a state S
at a time T. I guess a good analogy is event sourcing.
On Thu, Apr 6, 2017 at 10:30 PM, Jörn Franke wrote:
> I do think
36 matches
Mail list logo