That sounds great. Thanks.
Can I assume that source for a stream in spark can only be some external source
like kafka etc.? Source cannot be some rdd in spark or some external file ?
Thanks,
Udbhav
From: ayan guha [mailto:guha.a...@gmail.com]
Sent: Friday, September 16, 2016 3:01 AM
To: Udbhav
On 15 Sep 2016, at 19:37, Chen, Kevin
> wrote:
Hi,
Has any one encountered an issue of missing output partition file in S3 ? My
spark job writes output to a S3 location. Occasionally, I noticed one partition
file is missing. As a result,
On 16 Sep 2016, at 01:03, Peyman Mohajerian
> wrote:
You can listen to files in a specific directory using:
Take a look at:
http://spark.apache.org/docs/latest/streaming-programming-guide.html
streamingContext.fileStream
yes, this works
here's
> On 16 Sep 2016, at 04:43, gsvigruha
> wrote:
>
> Hi,
>
> is there a way to impersonate multiple users using the same SparkContext
> (e.g. like this
> https://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-common/Superusers.html)
> when going
are you using speculation?
On 15 September 2016 at 21:37, Chen, Kevin wrote:
> Hi,
>
> Has any one encountered an issue of missing output partition file in S3 ?
> My spark job writes output to a S3 location. Occasionally, I noticed one
> partition file is missing. As a
In fact you can use rdd as well using queue stream but it is considered for
testing, as per documents.
On 16 Sep 2016 17:44, "ayan guha" wrote:
> Rdd no. File yes, using fileStream. But filestream does not support
> replay, I think. You need to manage checkpoint yourself.
>
You may have to decrease the checkpoint interval to say 5 if you're
getting StackOverflowError. You may have a particularly deep lineage
being created during iterations.
No space left on device means you don't have enough local disk to
accommodate the big shuffles in some stage. You can add more
Hi Sean,
At the moment I am using Zeppelin with Spark SQL to get data from Hive. So
any connection here using visitation has to be through this sort of API.
I know Tableau only uses SQL. Zeppelin can use Spark sql directly or
through Spark Thrift Server.
The question is a user may want to
Hi michael,Well for nested structs, I saw in the tests the behaviour defined by
SPARK-12512 for the "a.b.c" handling in withColumn, and even if it's not ideal
for me, I managed to make it work anyway like that :> df.withColumn("a",
struct(struct(myUDF(df("a.b.c." // I didn't put back the
Rdd no. File yes, using fileStream. But filestream does not support replay,
I think. You need to manage checkpoint yourself.
On 16 Sep 2016 16:56, "Udbhav Agarwal" wrote:
> That sounds great. Thanks.
>
> Can I assume that source for a stream in spark can only be some
Why Hive and why precompute data at 15 minute latency? there are
several ways here to query the source data directly with no extra step
or latency here. Even Spark SQL is real-time-ish for queries on the
source data, and Impala (or heck Drill etc) are.
On Thu, Sep 15, 2016 at 10:56 PM, Mich
countApprox gives the best answer within some timeout. Is it possible
that 1ms is more than enough to count this exactly? then the
confidence wouldn't matter. Although that seems way too fast, you're
counting ranges whose values don't actually matter, and maybe the
Python side is smart enough to
Hi,
try this conf
val sc = new SparkContext(conf)
sc.hadoopConfiguration.setBoolean("parquet.enable.summary-metadata", false)
Regards,
Sai Ganesh
On Thu, Sep 15, 2016 at 11:34 PM, gaurav24 [via Apache Spark User List] <
ml-node+s1001560n27738...@n3.nabble.com> wrote:
> Hi Rok,
>
> facing
Is your Hive Thrift Server up and running on port
jdbc:hive2://10001?
Do the following
netstat -alnp |grep 10001
and see whether it is actually running
HTH
Dr Mich Talebzadeh
LinkedIn *
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
Hi
I am trying to run Spark 2.0.0 in the Microsoft Windows environment without
hadoop or hive. I am running it in the local mode i.e. cmd> spark-shell and can
run the shell. When I try to run the sample example
Doc says:
Take a 2-variable feature vector as an example: (x, y), if we want to
expand it with degree 2, then we get (x, x * x, y, x * y, y * y).
I know polynomial expansion of (x+y)^2 = x^2 + 2xy + y^2 but can't relate
it to above.
Thanks
--
[image: What's New with Xactly]
Hi,
I am trying to connect to Hive from Spark application in Kerborized cluster and
get the following exception. Spark version is 1.4.1 and Hive is 1.2.1. Outside
of spark the connection goes through fine.
Am I missing any configuration parameters?
ava.sql.SQLException: Could not open
The result includes, essentially, all the terms in (x+y) and (x+y)^2,
and so on up if you chose a higher power. It is not just the
second-degree terms.
On Fri, Sep 16, 2016 at 7:43 PM, Nirav Patel wrote:
> Doc says:
>
> Take a 2-variable feature vector as an example: (x,
Hi Advait,
It's due to https://issues.apache.org/jira/browse/SPARK-15565.
See http://stackoverflow.com/a/38945867/1305344 for a solution (that's
spark.sql.warehouse.dir away). Upvote if it works for you. Thanks!
Pozdrawiam,
Jacek Laskowski
https://medium.com/@jaceklaskowski/
Mastering
Hi! Can you split init code with current comand? I thing it is main problem
in your code.
16 сент. 2016 г. 8:26 PM пользователь "Benjamin Kim"
написал:
> Has anyone using Spark 1.6.2 encountered very slow responses from pulling
> data from PostgreSQL using JDBC? I can get to
Hi,
Have you seen the previous thread?
https://www.mail-archive.com/user@spark.apache.org/msg56791.html
// maropu
On Sat, Sep 17, 2016 at 11:34 AM, Qiang Li wrote:
> Hi,
>
>
> I ran some jobs with Spark 2.0 on Yarn, I found all tasks finished very
> quickly, but the last
Hi Anupama
To me it looks like issue with the SPN with which you are trying to connect
to hive2 , i.e. hive@hostname.
Are you able to connect to hive from spark-shell?
Try getting the tkt using any other user keytab but not hadoop services
keytab and then try running the spark submit.
Thanks
Thanks, That is what I am missing. I have added cache before action, and
that 2nd processing is avoided.
2016-09-10 5:10 GMT-07:00 Cody Koeninger :
> Hard to say without seeing the code, but if you do multiple actions on an
> Rdd without caching, the Rdd will be computed
Hi,
It'd be better to set `predicates` in jdbc arguments for loading in
parallel.
See:
https://github.com/apache/spark/blob/branch-1.6/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala#L200
// maropu
On Sat, Sep 17, 2016 at 7:46 AM, Benjamin Kim wrote:
> I
I am testing this in spark-shell. I am following the Spark documentation by
simply adding the PostgreSQL driver to the Spark Classpath.
SPARK_CLASSPATH=/path/to/postgresql/driver spark-shell
Then, I run the code below to connect to the PostgreSQL database to query. This
is when I have
Do Ignite and Alluxio offer reasonable means of transferring data, in memory,
from Spark to MPI? A straightforward way to transfer data is use piping, but
unless you have MPI processes running in a one-to-one mapping to the Spark
partitions, this will require some complicated logic to get working
Hi,
I ran some jobs with Spark 2.0 on Yarn, I found all tasks finished very
quickly, but the last step, spark spend lots of time to rename or move data
from s3 temporary directory to real directory, then I try to set
Hello.
Found that there is also Spark metric Sink like MetricsServlet.
which is enabled by default:
https://apache.googlesource.com/spark/+/refs/heads/master/core/src/main/scala/org/apache/spark/metrics/MetricsConfig.scala#40
Tried urls:
On master:
http://localhost:8080/metrics/master/json/
Sorry, it wasn't the count it was the reduce method that retrieves
information from the RDD.
I has to go through all the rdd values to return the result.
2016-09-16 11:18 GMT-03:00 chen yong :
> Dear Dirceu,
>
>
> I am totally confused . In your reply you mentioned ".the
Dear Dirceu,
I am totally confused . In your reply you mentioned ".the count does that,
..." .However, in the code snippet shown in the attachment file
FelixProblem.png of your previous mail, I cannot find any 'count' ACTION is
called. Would you please clearly show me the line it is
Also, I wonder what is the right way to debug spark program. If I use ten
anonymous function in one spark program, for debugging each of them, i have to
place a COUNT action in advace and then remove it after debugging. Is that the
right way?
发件人: Dirceu
Hi All,
I am running a spark program on secured cluster which creates SqlContext for
creating dataframe over phoenix table.
When I run my program in local mode with --master option set to local[2] my
program works completely fine, however when I try to run same program with
master option set
Sent from my iPhone
> On Sep 15, 2016, at 1:37 PM, Chen, Kevin wrote:
>
> Hi,
>
> Has any one encountered an issue of missing output partition file in S3 ? My
> spark job writes output to a S3 location. Occasionally, I noticed one
> partition file is missing. As a
Hi,
I wanted to understand if there is any other advantage besides api syntax
when using hive/table api vs. dataset api in spark sql(v2.0)?
Any additional optimizations maybe?
I'm most interested in parquet partitioned tables stored on s3. Is there any
difference if I'm comfortable with dataset
Hello Felix,
No, this line isn't the one that is triggering the execution of the
function, the count does that, unless your count val is a lazy val.
The count method is the one that retrieves the information of the rdd, it
has do go through all of it's data do determine how many records the RDD
Oh this is the netflix dataset right? I recognize it from the number
of users/items. It's not fast on a laptop or anything, and takes
plenty of memory, but succeeds. I haven't run this recently but it
worked in Spark 1.x.
On Fri, Sep 16, 2016 at 5:13 PM, Roshani Nagmote
I am also surprised that I face this problems with fairy small dataset on
14 M4.2xlarge machines. Could you please let me know on which dataset you
can run 100 iterations of rank 30 on your laptop?
I am currently just trying to run the default example code given with spark
to run ALS on movie
Any solutions for this?
Spark version: 1.4.1, running in standalone mode.
All my applications complete succesfully but the spark master UI shows that
the executors in KILLED Status.
Is it just a UI bug or are my executors actually KILLED?
--
View this message in context:
Hello,
Thanks for your reply.
Yes, Its netflix dataset. And when I get no space on device, my ‘/mnt’
directory gets filled up. I checked.
/usr/lib/spark/bin/spark-submit --deploy-mode cluster --master yarn --class
org.apache.spark.examples.mllib.MovieLensALS --jars
Has anyone using Spark 1.6.2 encountered very slow responses from pulling data
from PostgreSQL using JDBC? I can get to the table and see the schema, but when
I do a show, it takes very long or keeps timing out.
The code is simple.
val jdbcDF = sqlContext.read.format("jdbc").options(
40 matches
Mail list logo