Hi,
I'm trying to migrate some hive scripts to Spark-SQL. However, I
found some statement is incompatible in Spark-sql.
Here is my SQL. And the same SQL works fine in HIVE environment.
SELECT
*if(ad_user_id1000, 1000, ad_user_id) as user_id*
FROM
Have you solved this problem?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Linkage-error-duplicate-class-definition-tp9482p21260.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
You can persist the RDD in (2) right after it is created. It will not
cause it to be persisted immediately, but rather the first time it is
materialized. If you persist after (3) is calculated, then it will be
re-calculated (and persisted) after (4) is calculated.
On Tue, Jan 20, 2015 at 3:38 AM,
Hi experts!
I am using spark streaming with kinesis and getting this exception while
running program
java.lang.LinkageError: loader (instance of
org/apache/spark/executor/ChildExecutorURLClassLoader$userClassLoader$):
attempted duplicate class definition for name:
Hi,
Yes, this is what I'm doing. I'm using hiveContext.hql() to run my
query.
But, the problem still happens.
On Tue, Jan 20, 2015 at 7:24 PM, DEVAN M.S. msdeva...@gmail.com wrote:
Add one more library
libraryDependencies += org.apache.spark % spark-hive_2.10 % 1.2.0
val
Can you share your code ?
Devan M.S. | Research Associate | Cyber Security | AMRITA VISHWA
VIDYAPEETHAM | Amritapuri | Cell +919946535290 |
On Tue, Jan 20, 2015 at 5:03 PM, Xuelin Cao xuelincao2...@gmail.com wrote:
Hi,
Yes, this is what I'm doing. I'm using hiveContext.hql() to run
Hi, I'm using Spark 1.2
On Tue, Jan 20, 2015 at 5:59 PM, Wang, Daoyuan daoyuan.w...@intel.com
wrote:
Hi Xuelin,
What version of Spark are you using?
Thanks,
Daoyuan
*From:* Xuelin Cao [mailto:xuelincao2...@gmail.com]
*Sent:* Tuesday, January 20, 2015 5:22 PM
*To:* User
Thanks Sean !
On Tue, Jan 20, 2015 at 3:32 PM, Sean Owen so...@cloudera.com wrote:
You can persist the RDD in (2) right after it is created. It will not
cause it to be persisted immediately, but rather the first time it is
materialized. If you persist after (3) is calculated, then it will be
In Spark SQL we have Row objects which contain a list of fields that make
up a row. A Rowhas ordinal accessors such as .getInt(0) or getString(2).
Say ordinal 0 = ID and ordinal 1 = Name. It becomes hard to remember what
ordinal is what, making the code confusing.
Say for example I have the
I found the following to be a good discussion of the same topic:
http://apache-spark-user-list.1001560.n3.nabble.com/The-concurrent-model-of-spark-job-stage-task-td13083.html
From: so...@cloudera.com
Date: Tue, 20 Jan 2015 10:02:20 +
Subject: Re: Does Spark automatically run different
Hi, I try to run Spark on YARN cluster by set master as yarn-client on java
code. It works fine with count task by not working with other command.
It threw ClassCastException:
java.lang.ClassCastException: cannot assign instance of
java.lang.invoke.SerializedLambda to field
Hi,
For client mode spark submits of applications, we can do the following:
def createStreamingContext() = {
...
val sc = new SparkContext(conf)
// Create a StreamingContext with a 1 second batch size
val ssc = new StreamingContext(sc, Seconds(1))
}
...
val ssc =
Hi Xuelin,
What version of Spark are you using?
Thanks,
Daoyuan
From: Xuelin Cao [mailto:xuelincao2...@gmail.com]
Sent: Tuesday, January 20, 2015 5:22 PM
To: User
Subject: IF statement doesn't work in Spark-SQL?
Hi,
I'm trying to migrate some hive scripts to Spark-SQL. However, I found
I am a beginner to spark streaming. So have a basic doubt regarding
checkpoints. My use case is to calculate the no of unique users by day. I am
using reduce by key and window for this. Where my window duration is 24
hours and slide duration is 5 mins. I am updating the processed record to
Which context are you using HiveContext or SQLContext ? Can you try
with HiveContext
??
Devan M.S. | Research Associate | Cyber Security | AMRITA VISHWA
VIDYAPEETHAM | Amritapuri | Cell +919946535290 |
On Tue, Jan 20, 2015 at 3:49 PM, Xuelin Cao xuelincao2...@gmail.com wrote:
Hi, I'm using
Add one more library
libraryDependencies += org.apache.spark % spark-hive_2.10 % 1.2.0
val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
repalce sqlContext with hiveContext. Its working while using HiveContext
for me.
Devan M.S. | Research Associate | Cyber Security | AMRITA
I run Spark 1.2 and do not have this issue. I dont believe the Hive version
would matter(I run spark1.2 with Hive12 profile) but that would be a good test.
The last version I tried for you was a cdh4.2 spark1.2 prebuilt without
pointing to an external hive install(in fact I tried it on a
Hi,
Seems you have such a large window (24 hours), so the phenomena of memory
increasing is expectable, because of window operation will cache the RDD within
this window in memory. So for your requirement, memory should be enough to hold
the data of 24 hours.
I don't think checkpoint in Spark
Not sure if anyone who can help has seen this. Any suggestions would be
appreciated, thanks!
On Mon Jan 19 2015 at 1:50:43 PM Dave dla...@gmail.com wrote:
Hi,
I've setup my first spark cluster (1 master, 2 workers) and an iPython
notebook server that I'm trying to setup to access the
The below is not exactly a solution to your question but this is what we
are doing. For the first time we do end up doing row.getstring() and we
immediately parse it through a map function which aligns it to either a
case class or a structType. Then we register it as a table and use just
column
Hi Hannes,
Good to know I'm not alone on the boat. Sorry about not posting back, I
haven't gone in a while onto the user list.
It's on my agenda to get over this issue. Will be very important for our
recovery implementation. I have done an internal proof of concept but
without any conclusions
Hey people,
I have run into some issues regarding saving the k-means mllib model in
Spark SQL by converting to a schema RDD. This is what I am doing:
case class Model(id: String, model:
org.apache.spark.mllib.clustering.KMeansModel)
import sqlContext.createSchemaRDD
val rowRdd =
Hi all!
I found this problem when I tried running python application on Amazon's
EMR yarn cluster.
It is possible to run bundled example applications on EMR but I cannot
figure out how to run a little bit more complex python application which
depends on some other python scripts. I tried adding
Hi Folks,
I'm writing a GraphX app and I need to do some test-driven development.
I've got Spark running on our little cluster and have built and run some
hello world apps, so that's all good.
I've looked through the test source and found lots of helpful examples that
use SharedSparkContext, and
Hi,
I am trying to aggregate a key based on some timestamp, and I believe that
spilling to disk is changing the order of the data fed into the combiner.
I have some timeseries data that is of the form: (key, date, other
data)
Partition 1
(A, 2, ...)
(B, 4, ...)
(A, 1, ...)
It’s an array.length, where the array is null. Looking through the code, it
looks like the type converter assumes that FileSystem.globStatus never
returns null, but that is not the case. Digging through the Hadoop
codebase, inside Globber.glob, here’s what I found:
/*
* When the input
All,
I strongly suspect this might be caused by a glitch in the communication
with Google Cloud Storage where my job is writing to, as this NPE exception
shows up fairly randomly. Any ideas?
Exception in thread Thread-126 java.lang.NullPointerException
at
Hey all,
Is there any notion of a lightweight python client for submitting jobs to a
Spark cluster remotely? If I essentially install Spark on the client
machine, and that machine has the same OS, same version of Python, etc.,
then I'm able to communicate with the cluster just fine. But if Python
Hi all!
I found this problem when I tried running python application on Amazon's
EMR yarn cluster.
It is possible to run bundled example applications on EMR but I cannot
figure out how to run a little bit more complex python application which
depends on some other python scripts. I tried adding
Can anybody answer this? Do I have to have hdfs to achieve this?
Regards,
Ningjun Wang
Consulting Software Engineer
LexisNexis
121 Chanlon Road
New Providence, NJ 07974-1541
From: Wang, Ningjun (LNG-NPV) [mailto:ningjun.w...@lexisnexis.com]
Sent: Friday, January 16, 2015 1:15 PM
To: Imran
Hi all,
I would like to apply a function over all elements for each key (assuming
key-value RDD). For instance, imagine I have:
import numpy as np
a = np.array([[1, 'hola', 'adios'],[2, 'hi', 'bye'],[2, 'hello',
'goodbye']])
a = sc.parallelize(a)
Then I want to create a key-value RDD, using the
Hi Chris,
Short answer is no, not yet.
Longer answer is that PySpark only supports client mode, which means your
driver runs on the same machine as your submission client. By corollary
this means your submission client must currently depend on all of Spark and
its dependencies. There is a patch
Hi Vladimir,
Yes, as the error messages suggests, PySpark currently only supports local
files. This does not mean it only runs in local mode, however; you can
still run PySpark on any cluster manager (though only in client mode). All
this means is that your python files must be on your local file
Related question - is execution of different stages optimized? I.e.
map followed by a filter will require 2 loops or they will be combined
into single one?
On Tue, Jan 20, 2015 at 4:33 AM, Bob Tiernay btier...@hotmail.com wrote:
I found the following to be a good discussion of the same topic:
Hi Justin,
I believe the intended semantics of groupByKey or cogroup is that the
ordering *within a key *is not preserved if you spill. In fact, the test
cases for the ExternalAppendOnlyMap only assert that the Set representation
of the results is as expected (see this line
Thanks Aniket! It is working now.
Anny
On Mon, Jan 19, 2015 at 5:56 PM, Aniket Bhatnagar
aniket.bhatna...@gmail.com wrote:
When you repartiton, ordering can get lost. You would need to sort after
repartitioning.
Aniket
On Tue, Jan 20, 2015, 7:08 AM anny9699 anny9...@gmail.com wrote:
|IF| is implemented as a generic UDF in Hive (|GenericUDFIf|). It seems
that this function can’t be properly resolved. Could you provide a
minimum code snippet that reproduces this issue?
Cheng
On 1/20/15 1:22 AM, Xuelin Cao wrote:
Hi,
I'm trying to migrate some hive scripts to
Guess this can be helpful:
http://stackoverflow.com/questions/14252615/stack-function-in-hive-how-to-specify-multiple-aliases
On 1/19/15 8:26 AM, mucks17 wrote:
Hello
I use Hive on Spark and have an issue with assigning several aliases to the
output (several return values) of an UDF. I ran
I think you can resort to a Hive table partitioned by date
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-PartitionedTables
On 1/11/15 9:51 PM, Paul Wais wrote:
Dear List,
What are common approaches for addressing over a union of tables /
RDDs? E.g.
If the dataset is not huge (in a few GB), you can setup NFS instead of
HDFS (which is much harder to setup):
1. export a directory in master (or anyone in the cluster)
2. mount it in the same position across all slaves
3. read/write from it by file:///path/to/monitpoint
On Tue, Jan 20, 2015 at
I don’t think it will work without HDFS.
Mohammed
From: Wang, Ningjun (LNG-NPV) [mailto:ningjun.w...@lexisnexis.com]
Sent: Tuesday, January 20, 2015 7:55 AM
To: Wang, Ningjun (LNG-NPV)
Cc: user@spark.apache.org
Subject: RE: Can I save RDD to local file system and then read it back on spark
Please also see this thread:
http://search-hadoop.com/m/JW1q5De7pU1
On Tue, Jan 20, 2015 at 3:58 PM, Sean Owen so...@cloudera.com wrote:
Guava is shaded in Spark 1.2+. It looks like you are mixing versions
of Spark then, with some that still refer to unshaded Guava. Make sure
you are not
Yeah, as Michael said, I forgot that UDT is not a public API. Xiangrui's
suggestion makes more sense.
Cheng
On 1/20/15 12:49 PM, Xiangrui Meng wrote:
You can save the cluster centers as a SchemaRDD of two columns (id:
Int, center: Array[Double]). When you load it back, you can construct
the
Hey Surbhit,
In this case, the web UI stats is not accurate. Please refer to this
thread for an explanation:
https://www.mail-archive.com/user@spark.apache.org/msg18919.html
Cheng
On 1/13/15 1:46 AM, Surbhit wrote:
Hi,
I am using spark 1.1.0.
I am using the spark-sql shell to run all the
Guava is shaded in Spark 1.2+. It looks like you are mixing versions
of Spark then, with some that still refer to unshaded Guava. Make sure
you are not packaging Spark with your app and that you don't have
other versions lying around.
On Tue, Jan 20, 2015 at 11:55 PM, Shailesh Birari
+1. I can confirm this. It says collect fails in Py4J
--- Original Message ---
From: Dave dla...@gmail.com
Sent: January 20, 2015 6:49 AM
To: user@spark.apache.org
Subject: Re: Error for first run from iPython Notebook
Not sure if anyone who can help has seen this. Any suggestions would be
Hello,
I recently upgraded my setup from Spark 1.1 to Spark 1.2.
My existing applications are working fine on ubuntu cluster.
But, when I try to execute Spark MLlib application from Eclipse (Windows
node) it gives java.lang.NoClassDefFoundError:
com/google/common/base/Preconditions exception.
Hi,
I issued the following command in a ec2 cluster launched using spark-ec2:
~/spark/bin/spark-submit --class com.crowdstar.cluster.etl.ParseAndClean
--master spark://ec2-54-185-107-113.us-west-2.compute.amazonaws.com:7077
--deploy-mode cluster --total-executor-cores 4
A map followed by a filter will not be two stages, but rather one stage that
pipelines the map and filter.
On Jan 20, 2015, at 10:26 AM, Kane Kim kane.ist...@gmail.com wrote:
Related question - is execution of different stages optimized? I.e.
map followed by a filter will require 2 loops
Hi sranga,
Were you ever able to get authentication working with the temporary IAM
credentials (id, secret, token)? I am in the same situation and it would
be great if we could document a solution so others can benefit from this
Thanks!
sranga wrote
Thanks Rishi. That is exactly what I am
Spark's network-common package depends on guava as a provided dependency
in order to avoid conflicting with other libraries (e.g., Hadoop) that
depend on specific versions. com/google/common/base/Preconditions has been
present in Guava since version 2, so this is likely a dependency not
found
Hi,
I am getting task result deserialization error (kryo is enabled). Is it
some sort of `chill` registration issue at front end?
This is application that lists spark as maven dependency (so it gets
correct hadoop and chill dependencies in classpath, i checked).
Thanks in advance.
15/01/20
Thanks Aaron.
Adding Guava jar resolves the issue.
Shailesh
On Wed, Jan 21, 2015 at 3:26 PM, Aaron Davidson ilike...@gmail.com wrote:
Spark's network-common package depends on guava as a provided dependency
in order to avoid conflicting with other libraries (e.g., Hadoop) that
depend on
I am joining two tables as below, the program stalls at below log line and
never proceeds.
What might be the issue and possible solution?
INFO SparkContext: Starting job: RangePartitioner at Exchange.scala:79
Table 1 has 450 columns
Table2 has 100 columns
Both tables have few million
Hello everyone,
Is there a python connector for Spark and Cassandra as there is one for
Java. I found a Java connector by DataStax on github:
https://github.com/datastax/spark-cassandra-connector
I am looking for something similar in Java.
Thanks
Hi,
I am having similar issues. Have you found any resolution ?
Thank you
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-with-Kafka-tp21222p21276.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
Hi Frank,
Its a normal eclipse project where I added Scala and Spark libraries as
user libraries.
Though, I am not attaching any hadoop libraries, in my application code I
have following line.
System.setProperty(hadoop.home.dir, C:\\SB\\HadoopWin)
This Hadoop home dir contains winutils.exe
As title.
Is there some mechanism to recover to make the job can be completed?
Any comments will be very appreciated.
Best Regards,
Anzhsoft
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional
Hi all
How can I add MapType and ArrayType to schema when I create StructType
programmatically?
val schema =
StructType(
schemaString.split( ).map(fieldName = StructField(fieldName,
StringType, true)))
above code from spark document works fine but if I change StringType to
MapType or
Hi,
I just recently tried to migrate from Spark 1.1 to Spark 1.2 - using
PySpark. Initially, I was super glad, noticing that Spark 1.2 is way faster
than Spark 1.1. However, the initial joy faded quickly when I noticed that
all my stuff didn't successfully terminate operations anymore. Using
Hello,
I double checked the libraries. I am linking only with Spark 1.2.
Along with Spark 1.2 jars I have Scala 2.10 jars and JRE 7 jars linked and
nothing else.
Thanks,
Shailesh
On Wed, Jan 21, 2015 at 12:58 PM, Sean Owen so...@cloudera.com wrote:
Guava is shaded in Spark 1.2+. It looks
Hey Yana,
Sorry for the late reply, missed this important thread somehow. And many
thanks for reporting this. It turned out to be a bug — filter pushdown
is only enabled when using client side metadata, which is not expected,
because task side metadata code path is more performant. And I
Hi,
I don't think current Spark Streaming support this feature, all the DStream
lineage is fixed after the context is started.
Also stopping a stream is not supported, instead currently we need to stop the
whole streaming context to meet what you want.
Thanks
Saisai
-Original
Are the gz files roughly equal in size? Do you know that your partitions
are roughly balanced? Perhaps some cores get assigned tasks that end very
quickly, while others get most of the work.
On Sat Jan 17 2015 at 2:02:49 AM Gautham Anil gautham.a...@gmail.com
wrote:
Hi,
Thanks for getting
Shailesh,
To add, are you packaging Hadoop in your app? Hadoop will pull in Guava. Not
sure if you are using Maven (or what) to build, but if you can pull up your
builds dependency tree, you will likely find com.google.guava being brought in
by one of your dependencies.
Regards,
Frank Austin
Kevin (Sangwoo) Kim wrote
If keys are not too many,
You can do like this:
val data = List(
(A, Set(1,2,3)),
(A, Set(1,2,4)),
(B, Set(1,2,3))
)
val rdd = sc.parallelize(data)
rdd.persist()
rdd.filter(_._1 == A).flatMap(_._2).distinct.count
rdd.filter(_._1 ==
Great to hear you got solution!!
Cheers!
Kevin
On Wed Jan 21 2015 at 11:13:44 AM jagaximo takuya_seg...@dwango.co.jp
wrote:
Kevin (Sangwoo) Kim wrote
If keys are not too many,
You can do like this:
val data = List(
(A, Set(1,2,3)),
(A, Set(1,2,4)),
(B, Set(1,2,3))
)
val
You need to provide key type, value type for map type, element type for
array type, and whether they contain null:
|StructType(Array(
StructField(map_field,MapType(keyType =IntegerType, valueType =StringType,
containsNull =true), nullable =true),
If using Maven, one simply use whatever version they prefer and at build time
and the artifact using something like:
buildplugins plugin
groupIdorg.apache.maven.plugins/groupId
artifactIdmaven-shade-plugin/artifactIdexecutions
execution
Hi,
I am trying to move from 1.1 to 1.2 and found that the newFilesOnly=false
(Intend to include old files) does not work anymore. It works great in 1.1,
this should be introduced by the last change of this class.
Does this flag behavior change or is it a regression?
Issue should be caused by
On 1/15/15 11:26 PM, Nathan McCarthy wrote:
Thanks Cheng!
Is there any API I can get access too (e.g. ParquetTableScan) which
would allow me to load up the low level/baseRDD of just RDD[Row] so I
could avoid the defensive copy (maybe lose our on columnar storage etc.).
We have parts of our
Hi Andrew,
Thanks for your response! For our use case, we aren't actually grouping,
but rather updating running aggregates. I just picked grouping because it
made the example easier to write out. However, when we merge combiners, the
combiners have to have data that are adjacent to each other in
Could you provide a short script to reproduce this issue?
On Tue, Jan 20, 2015 at 9:00 PM, TJ Klein tjkl...@gmail.com wrote:
Hi,
I just recently tried to migrate from Spark 1.1 to Spark 1.2 - using
PySpark. Initially, I was super glad, noticing that Spark 1.2 is way faster
than Spark 1.1.
the LogParser instance is not serializable, and thus cannot be a broadcast,
what’s worse, it contains an LRU cache, which is essential to the
performance, and we would like to share among all the tasks on the same
node.
If it is the case, what’s the recommended way to share a variable among all
Hi,
It's a bit of a longer script that runs some deep learning training.
Therefore it is a bit hard to wrap up easily.
Essentially I am having a loop, in which a gradient is computed on each
node and collected (this is where it freezes at some point).
grads =
Could you try to disable the new feature of reused worker by:
spark.python.worker.reuse = false
On Tue, Jan 20, 2015 at 11:12 PM, Tassilo Klein tjkl...@bwh.harvard.edu wrote:
Hi,
It's a bit of a longer script that runs some deep learning training.
Therefore it is a bit hard to wrap up easily.
I don't know of any reason to think the singleton pattern doesn't work or
works differently. I wonder if, for example, task scheduling is different
in 1.2 and you have more partitions across more workers and so are loading
more copies more slowly into your singletons.
On Jan 21, 2015 7:13 AM,
Hi all,
Please help me to find out best way for K-nearest neighbor using spark for
large data sets.
Currently we are migrating from spark 1.1 to spark 1.2, but found the
program 3x slower, with nothing else changed.
note: our program in spark 1.1 has successfully processed a whole year
data, quite stable.
the main script is as below
sc.textFile(inputPath)
.flatMap(line =
Please check which netty jar(s) are on the classpath.
NioWorkerPool(Executor workerExecutor, int workerCount) was added in netty
3.5.4
Cheers
On Tue, Jan 20, 2015 at 4:15 PM, ey-chih chow eyc...@hotmail.com wrote:
Hi,
I issued the following command in a ec2 cluster launched using spark-ec2:
Maybe some change related to serialize the closure cause LogParser is
not a singleton any more, then it is initialized for every task.
Could you change it to a Broadcast?
On Tue, Jan 20, 2015 at 10:39 PM, Fengyun RAO raofeng...@gmail.com wrote:
Currently we are migrating from spark 1.1 to spark
Hi,
I am developing a Spark Streaming application where I want every item in my
stream to be assigned a unique, strictly increasing Long. My input data
already has RDD-local integers (from 0 to N-1) assigned, so I am doing the
following:
var totalNumberOfItems = 0L
// update the keys of the
How about using accumulators
http://spark.apache.org/docs/1.2.0/programming-guide.html#accumulators?
Thanks
Best Regards
On Wed, Jan 21, 2015 at 12:53 PM, Tobias Pfeiffer t...@preferred.jp wrote:
Hi,
I am developing a Spark Streaming application where I want every item in
my stream to be
currently we migrate from 1.1 to 1.2, and found our program 3x slower,
maybe due to the singleton hack?
could you explain in detail why or how The singleton hack works very
different in spark 1.2.0
thanks!
2015-01-18 20:56 GMT+08:00 octavian.ganea octavian.ga...@inf.ethz.ch:
The singleton
See also SPARK-3276 and SPARK-3553. Can you say more about the
problem? what are the file timestamps, what happens when you run, what
log messages if any are relevant. I do not expect there was any
intended behavior change.
On Wed, Jan 21, 2015 at 5:17 AM, Terry Hole hujie.ea...@gmail.com wrote:
Hi,
On Wed, Jan 21, 2015 at 4:46 PM, Akhil Das ak...@sigmoidanalytics.com
wrote:
How about using accumulators
http://spark.apache.org/docs/1.2.0/programming-guide.html#accumulators?
As far as I understand, they solve the part of the problem that I am not
worried about, namely increasing the
The assumption of implicit feedback model is that the unobserved
ratings are more likely to be negative. So you may want to add some
negatives for evaluation. Otherwise, the input ratings are all 1 and
the test ratings are all 1 as well. The baseline predictor, which uses
the average rating (that
You can get a SchemaRDD from the Hive table, map it into a RDD of
Vectors, and then construct a RowMatrix. The transformations are lazy,
so there is no external storage requirement for intermediate data.
-Xiangrui
On Sun, Jan 18, 2015 at 4:07 AM, guxiaobo1982 guxiaobo1...@qq.com wrote:
Hi,
We
First of all, even if the underlying dataset is partitioned as expected,
a shuffle can’t be avoided. Because Spark SQL knows nothing about the
underlying data distribution. However, this does reduce network IO.
You can prepare your data like this (say |CustomerCode| is a string
field with
You can save the cluster centers as a SchemaRDD of two columns (id:
Int, center: Array[Double]). When you load it back, you can construct
the k-means model from its cluster centers. -Xiangrui
On Tue, Jan 20, 2015 at 11:55 AM, Cheng Lian lian.cs@gmail.com wrote:
This is because KMeanModel is
Hey Yi,
I'm quite unfamiliar with Hadoop/HDFS auth mechanisms for now, but would
like to investigate this issue later. Would you please open an JIRA for
it? Thanks!
Cheng
On 1/19/15 1:00 AM, Yi Tian wrote:
Is there any way to support multiple users executing SQL on one thrift
server?
I
I had once worked on a named row feature but haven’t got time to finish
it. It looks like this:
|sql(...).named.map { row:NamedRow =
row[Int]('key) - row[String]('value)
}
|
Basically the |named| method generates a field name to ordinal map for
each RDD partition. This map is then shared
Hi all,
we have been trying to setup a stream using a custom receiver that would
pick up data from sql databases. we'd like to keep that stream context
running and dynamically change the streams on demand, adding and removing
streams based on demand. alternativel, if a stream is fixed, is it
|spark.sql.parquet.filterPushdown| defaults to |false| because there’s a
bug in Parquet which may cause NPE, please refer to
http://spark.apache.org/docs/latest/sql-programming-guide.html#configuration
This bug hasn’t been fixed in Parquet master. We’ll turn this on once
the bug is fixed.
In Spark SQL, Parquet filter pushdown doesn’t cover |HiveTableScan| for
now. May I ask why do you prefer |HiveTableScan| rather than
|ParquetTableScan|?
Cheng
On 1/19/15 5:02 PM, Xiaoyu Wang wrote:
The *spark.sql.parquet.**filterPushdown=true *has been turned on. But
set
Can you not do it with RDDs?
Thanks
Best Regards
On Wed, Jan 21, 2015 at 12:38 AM, jamborta jambo...@gmail.com wrote:
Hi all,
we have been trying to setup a stream using a custom receiver that would
pick up data from sql databases. we'd like to keep that stream context
running and
This is because |KMeanModel| is neither a built-in type nor a user
defined type recognized by Spark SQL. I think you can write your own UDT
version of |KMeansModel| in this case. You may refer to
|o.a.s.mllib.linalg.Vector| and |o.a.s.mllib.linalg.VectorUDT| as an
example.
Cheng
On 1/20/15
I have the following code in a Spark Job.
// Get the baseline input file(s) JavaPairRDDText,Text
hsfBaselinePairRDDReadable = sc.hadoopFile(baselineInputBucketFile,
SequenceFileInputFormat.class, Text.class, Text.class); JavaPairRDDString,
String hsfBaselinePairRDD =
98 matches
Mail list logo