Figured it out. spark-sql --master spark://sparkmaster:7077
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/submit-query-to-spark-cluster-using-spark-sql-tp17182p17183.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
You can open the application UI (that runs on 4040) and see how much memory
is being allocated to the executor tabs and from the environments tab.
Thanks
Best Regards
On Wed, Oct 22, 2014 at 9:55 PM, Holden Karau hol...@pigscanfly.ca wrote:
Hi Michael Campbell,
Are you deploying against yarn
Try doing a *cat -v your_data | head -n3 *and make sure you are not having
any ^M at the end of the lines. Also your 2,3 rows doens't contain any
space in the data.
Thanks
Best Regards
On Thu, Oct 23, 2014 at 9:23 AM, tridib tridib.sama...@live.com wrote:
Hello Experts,
I created a table
You can use the --jars option to submit multiple jars using the
spark-submit, so you can simply build the jar that you have modified.
Thanks
Best Regards
On Thu, Oct 23, 2014 at 11:16 AM, Mohit Jaggi mohitja...@gmail.com wrote:
i think you can give a list of jars - not just one - to
Hi all,
I have two network interface card on one node, one is a Eithernet card,
the other Infiniband HCA.
The master has two IP addresses, lets say 1.2.3.4 (for Eithernet card)
and 2.3.4.5 (for HCA).
I can start the master by
export SPARK_MASTER_IP='1.2.3.4';sbin/start-master.sh
to let master
Make sure the guava jar
http://mvnrepository.com/artifact/com.google.guava/guava/12.0 is present
in the classpath.
Thanks
Best Regards
On Thu, Oct 23, 2014 at 2:13 PM, Stephen Boesch java...@gmail.com wrote:
After having checked out from master/head the following error occurs when
attempting
Try providing the level of parallelism parameter to your reduceByKey
operation.
Thanks
Best Regards
On Fri, Oct 24, 2014 at 3:44 AM, xuhongnever xuhongne...@gmail.com wrote:
my code is here:
from pyspark import SparkConf, SparkContext
def Undirect(edge):
vector =
You can use spark-sql to solve this usecase, and you don't need to have
800G of memory (but of course if you are caching the whole data into
memory, then you would need it.). You can persist the data by setting
DISK_AND_MEMORY_SER property if you don't want to bring whole data into
memory, in this
On Thu, Oct 23, 2014 at 3:14 PM, xuhongnever xuhongne...@gmail.com wrote:
my code is here:
from pyspark import SparkConf, SparkContext
def Undirect(edge):
vector = edge.strip().split('\t')
if(vector[0].isdigit()):
return [(vector[0], vector[1])]
return []
conf =
Try using the --ip parameter
http://spark.apache.org/docs/latest/spark-standalone.html#starting-a-cluster-manually
while starting the worker. like:
spark-1.0.1/bin/spark-class org.apache.spark.deploy.worker.Worker --ip
1.2.3.4 spark://1.2.3.4:7077
Thanks
Best Regards
On Fri, Oct 24, 2014 at
Hi Joseph,
Thanks for the help.
I have tried this DecisionTree example with the latest spark code and it is
working fine now. But how do we choose the maxBins for this model?
Thanks
Lokesh
--
View this message in context:
I want to set up spark SQL to allow ad hoc querying over the last X days of
processed data, where the data is processed through spark. This would also
have to cache data (in memory only), so the approach I was thinking of was
to build a layer that persists the appropriate RDDs and stores them in
In the documentation it's said that we need to override the hashCode and
equals methods. Without overriding it does't work too. I get this error on
REPL and stand alone application
On Fri, Oct 24, 2014 at 3:29 AM, Prashant Sharma scrapco...@gmail.com
wrote:
Are you doing this in REPL ? Then
I imagine this is a side effect of the change that was just reverted,
related to publishing the effective pom? sounds related but I don't
know.
On Fri, Oct 24, 2014 at 2:22 AM, Yana Kadiyska yana.kadiy...@gmail.com wrote:
Hi folks,
I'm trying to deploy the latest from master branch and having
There's an issue in the way case classes are handled on the REPL and you
won't be able to use a case class as a key. See:
https://issues.apache.org/jira/browse/SPARK-2620
BTW, case classes already implement equals and hashCode. It's not needed to
implement those again.
Given that you already
Hi All,
I am relatively new to spark and currently having troubles with broadcasting
large variables ~500mb in size. Th
e broadcast fails with an error shown below and the memory usage on the
hosts also blow up.
Our hardware consists of 8 hosts (1 x 64gb (driver) and 7 x 32gb (workers))
and we
Hi,
I just wonder if there is any built-in function to get the execution time
for each of the jobs/tasks ? in simple words, how can I find out how much
time is spent on loading/mapping/filtering/reducing part of a job? I can
see printout in the logs but since there is no clear presentation of the
Hi,
I’m running a job whose simple task it is to find files that cannot be read
(sometimes our gz files are corrupted).
With 1.0.x, this worked perfectly. Since 1.1.0 however, I’m getting an
exception:
scala.collection.mutable.ArrayOps$ofRef$.length$extension(ArrayOps.scala:114)
Thanks Akhil.
I searched DISK_AND_MEMORY_SER trying to figure out how it works, and I
cannot find any documentation on that. Do you have a link for that?
If what DISK_AND_MEMORY_SER does is reading and writing to the disk with
some memory caching, does that mean the output will be written to
Hi all
I have written a job that reads data from HBASE and writes to HDFS (fairly
simple). While running the job, I noticed that a few of the tasks failed
with the following error. Quick googling on the error suggests that its an
unexplained error and is perhaps intermittent. What I am curious to
Just curious... Why would you not store the processed results in regular
relational database? Not sure what you meant by persist the appropriate
RDDs. Did you mean output of your job will be RDDs?
On 24 October 2014 13:35, ankits ankitso...@gmail.com wrote:
I want to set up spark SQL to allow
thanks -- that was it. I could swear this had worked for me before and
indeed it's fixed this morning.
On Fri, Oct 24, 2014 at 6:34 AM, Sean Owen so...@cloudera.com wrote:
I imagine this is a side effect of the change that was just reverted,
related to publishing the effective pom? sounds
i used standalone spark,set spark.driver.memory=5g,but spark-submit process use
57g memory, is this normal?how to decrease it?
I found this. So it seems that we should use -h or --host instead of -i
and --ip.
-i HOST, --ip IP Hostname to listen on (deprecated, please use
--host or -h)
-h HOST, --host HOST Hostname to listen on
在 10/24/2014 3:35 PM, Akhil Das 写道:
Try using the --ip parameter
The Spark UI has timing information. When running locally, it is at
http://localhost:4040
Otherwise the url to the UI is printed out onto the console when you
startup spark shell or run a job.
Reza
On Fri, Oct 24, 2014 at 5:51 AM, shahab shahab.mok...@gmail.com wrote:
Hi,
I just wonder if
Hi,
Trying to run a query on spark-sql but it keeps failing with this error on
the cli ( we are running spark-sql on a yarn cluster):
org.apache.spark.SparkException: Job cancelled because SparkContext was
shut down
at
That does seem a bit odd. How many Executors are running under this Driver?
Does the spark-submit process start out using ~60GB of memory right away or
does it start out smaller and slowly build up to that high? If so, how long
does it take to get that high?
Also, which version of Spark are you
Hi all,
I'm trying to set a pool for a JDBC session. I'm connecting to the thrift
server via JDBC client.
My installation appears to be good(queries run fine), I can see the pools
in the UI, but any attempt to set a variable (I tried
spark.sql.shuffle.partitions and
It does have support for caching using either CACHE TABLE tablename or
CACHE TABLE tablename AS SELECT
On Fri, Oct 24, 2014 at 1:05 AM, ankits ankitso...@gmail.com wrote:
I want to set up spark SQL to allow ad hoc querying over the last X days of
processed data, where the data is
You might be hitting: https://issues.apache.org/jira/browse/SPARK-4037
On Fri, Oct 24, 2014 at 11:32 AM, Yana Kadiyska yana.kadiy...@gmail.com
wrote:
Hi all,
I'm trying to set a pool for a JDBC session. I'm connecting to the thrift
server via JDBC client.
My installation appears to be
Maybe I'll add one more question. I think that the problem is with user, so I
would like to ask under which user are run Spark jobs on slaves?
__
Hi,
I am trying to implement function for text preprocessing in PySpark. I have amazon
Hi,
I would like to ask under which user is run the Spark program on slaves? My
Spark is running on top of the Yarn.
The reason I am asking for this is that I need to download data for NLTK
library and these data are dowloaded for specific python user and I am
currently struggling with this.
Is there a way to cache certain (or most latest) partitions of certain
tables ?
On Fri, Oct 24, 2014 at 2:35 PM, Michael Armbrust mich...@databricks.com
wrote:
It does have support for caching using either CACHE TABLE tablename or
CACHE TABLE tablename AS SELECT
On Fri, Oct 24, 2014 at
It won't be transparent, but you can do so something like:
CACHE TABLE newData AS SELECT * FROM allData WHERE date ...
and then query newData.
On Fri, Oct 24, 2014 at 12:06 PM, Sadhan Sood sadhan.s...@gmail.com wrote:
Is there a way to cache certain (or most latest) partitions of certain
That works perfect. Thanks again Michael
On Fri, Oct 24, 2014 at 3:10 PM, Michael Armbrust mich...@databricks.com
wrote:
It won't be transparent, but you can do so something like:
CACHE TABLE newData AS SELECT * FROM allData WHERE date ...
and then query newData.
On Fri, Oct 24, 2014 at
Hello,
My map function will call the following function (inc) which should yield
multiple values:
def inc(x:Int, y:Int)
={
if(condition)
{
for(i - 0 to 7) yield(x, y+i)
}
else
{
for(k - 0 to 24-y) yield(x, y+k)
for(j- 0 to y-16) yield(x+1,j)
}
}
The if part works
This is just a Scala question really. Use ++
def inc(x:Int, y:Int) = {
if (condition) {
for(i - 0 to 7) yield(x, y+i)
} else {
(for(k - 0 to 24-y) yield(x, y+k)) ++ (for(j- 0 to y-16) yield(x+1,j))
}
}
On Fri, Oct 24, 2014 at 8:52 PM, HARIPRIYA AYYALASOMAYAJULA
Hi Lokesh,
Glad the update fixed the bug. maxBins is a parameter you can tune based
on your data. Essentially, larger maxBins is potentially more accurate,
but will run more slowly and use more memory. maxBins must be = training
set size; I would say try some small values (4, 8, 16). If there
Thanks Sean!
On Fri, Oct 24, 2014 at 3:04 PM, Sean Owen so...@cloudera.com wrote:
This is just a Scala question really. Use ++
def inc(x:Int, y:Int) = {
if (condition) {
for(i - 0 to 7) yield(x, y+i)
} else {
(for(k - 0 to 24-y) yield(x, y+k)) ++ (for(j- 0 to y-16)
Thank you very much.
Changing to groupByKey works, it runs much more faster.
By the way, could you give me some explanation of the following
configurations, after reading the official explanation, i'm still confused,
what's the relationship between them? is there any memory overlap between
them?
At 2014-10-23 09:48:55 +0530, Arpit Kumar arp8...@gmail.com wrote:
error: value partitionBy is not a member of
org.apache.spark.rdd.RDD[(org.apache.spark.graphx.PartitionID,
org.apache.spark.graphx.Edge[ED])]
Since partitionBy is a member of PairRDDFunctions, it sounds like the implicit
On Fri, Oct 24, 2014 at 1:37 PM, xuhongnever xuhongne...@gmail.com wrote:
Thank you very much.
Changing to groupByKey works, it runs much more faster.
By the way, could you give me some explanation of the following
configurations, after reading the official explanation, i'm still confused,
I am trying to convert some json logs to Parquet and save them on S3.
In principle this is just
import org.apache.spark._
val sqlContext = new sql.SQLContext(sc)
val data = sqlContext.jsonFile(s3n://source/path/*/*,10e-8)
data.registerAsTable(data)
data.saveAsParquetFile(s3n://target/path)
This
Just wondering, any update on this? Is there a plan to integrate CJ's work
with mllib? I'm asking since SVM impl in MLLib did not give us good results
and we have to resort to training our svm classifier in a serial manner on
the driver node with liblinear.
Also, it looks like CJ Lin is coming
If the SVM is not already migrated to BFGS, that's the first thing you
should try...Basically following LBFGS Logistic Regression come up with
LBFGS based linear SVM...
About integrating TRON in mllib, David already has a version of TRON in
breeze but someone needs to validate it for linear SVM
This is very experimental and mostly unsupported, but you can start the
JDBC server from within your own programs
https://github.com/apache/spark/blob/master/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/HiveThriftServer2.scala#L45
by passing it the HiveContext.
On
Thanks Marcelo,
Let me spin this towards a parallel trajectory then, as the title change
implies. I think I will further read some of the articles at
https://spark.apache.org/research.html but basically, I understand Spark
keeps the data in-memory, and only pulls from hdfs, or at most writes the
Oh, I've only seen SVMWithSGD, hadn't realized LBFGS was implemented. I'll
try it out when I have time. Thanks!
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-LIBLINEAR-tp5546p17240.html
Sent from the Apache Spark User List mailing list archive at
Hi,
Is there a dockerfiles available which allow to setup a docker spark 1.1.0
cluster?
Thanks,
Josh
Hi,
here you can find some info regarding 1.0:
https://github.com/amplab/docker-scripts
Marek
2014-10-24 23:38 GMT+02:00 Josh J joshjd...@gmail.com:
Hi,
Is there a dockerfiles available which allow to setup a docker spark 1.1.0
cluster?
Thanks,
Josh
Oh snap--first I've heard of this repo.
Marek,
We are having a discussion related to this on SPARK-3821
https://issues.apache.org/jira/browse/SPARK-3821 you may be interested in.
Nick
On Fri, Oct 24, 2014 at 5:50 PM, Marek Wiewiorka marek.wiewio...@gmail.com
wrote:
Hi,
here you can find
We don't have SVMWithLBFGS, but you can check out how we implement
LogisticRegressionWithLBFGS, and we also deal with some condition
number improving stuff in LogisticRegressionWithLBFGS which improves
the performance dramatically.
Sincerely,
DB Tsai
@dbtsai for condition number what did you use ? Diagonal preconditioning of
the inverse of B matrix ? But then B matrix keeps on changing...did u
condition it after every few iterations ?
Will it be possible to put that code in Breeze since it will be very useful
to condition other solvers as
Daniel,
Currently, having Tachyon will at least help on the input part in this case.
Haoyuan
On Fri, Oct 24, 2014 at 2:01 PM, Daniel Mahler dmah...@gmail.com wrote:
I am trying to convert some json logs to Parquet and save them on S3.
In principle this is just
import org.apache.spark._
oh, we just train the model in the standardized space which will help
the convergence of LBFGS. Then we convert the weights to original
space so the whole thing is transparent to users.
Sincerely,
DB Tsai
---
My Blog: https://www.dbtsai.com
Usually when the SparkContext throws an NPE it means that it has been shut
down due to some earlier failure.
On Wed, Oct 22, 2014 at 5:29 PM, arthur.hk.c...@gmail.com
arthur.hk.c...@gmail.com wrote:
Hi,
I got java.lang.NullPointerException. Please help!
sqlContext.sql(select l_orderkey,
yeah, column normalizarion. for some of the datasets, without doing
this, it will not be converged.
Sincerely,
DB Tsai
---
My Blog: https://www.dbtsai.com
LinkedIn: https://www.linkedin.com/in/dbtsai
On Fri, Oct 24, 2014 at 3:46 PM, Debasish
Hi,
My Steps:
### HIVE
CREATE TABLE CUSTOMER (
C_CUSTKEYBIGINT,
C_NAME VARCHAR(25),
C_ADDRESSVARCHAR(40),
C_NATIONKEY BIGINT,
C_PHONE VARCHAR(15),
C_ACCTBALDECIMAL,
C_MKTSEGMENT VARCHAR(10),
C_COMMENTVARCHAR(117)
) row format serde 'com.bizo.hive.serde.csv.CSVSerde';
Hi,
Added “l_linestatus” it works, THANK YOU!!
sqlContext.sql(select l_linestatus, l_orderkey, l_linenumber, l_partkey,
l_quantity, l_shipdate, L_RETURNFLAG, L_LINESTATUS from lineitem order by
L_LINESTATUS limit 10).collect().foreach(println);
14/10/25 07:03:24 INFO DAGScheduler: Stage 12
Thanks a lot. Now it is working properly.
On Sat, Oct 25, 2014 at 2:13 AM, Ankur Dave ankurd...@gmail.com wrote:
At 2014-10-23 09:48:55 +0530, Arpit Kumar arp8...@gmail.com wrote:
error: value partitionBy is not a member of
org.apache.spark.rdd.RDD[(org.apache.spark.graphx.PartitionID,
Hi all,
I am using the GrpahLoader class to load graphs from edge list files. But
then I need to change the storage level of the graph to some other thing
than MEMORY_ONLY.
val graph = GraphLoader.edgeListFile(sc, fname,
minEdgePartitions =
Does a Spark worker node need access to Hive's metastore if part of a job
contains Hive queries?
Thanks,
Ken
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Worker-node-accessing-Hive-metastore-tp17255.html
Sent from the Apache Spark User List
62 matches
Mail list logo