You can call coalesce() to merge the small splits into bigger ones.
Davies
On Fri, Oct 17, 2014 at 5:35 PM, Larry Liu larryli...@gmail.com wrote:
Is it possible to disable input split if input is already small?
-
To
Also - if you're doing a text file read you can pass the number of
resulting partitions as the second argument.
On Oct 17, 2014 9:05 PM, Larry Liu larryli...@gmail.com wrote:
Thanks, Andrew. What about reading out of local?
On Fri, Oct 17, 2014 at 5:38 PM, Andrew Ash and...@andrewash.com
The complete code is as follows:
JavaHiveContext ctx;
JavaSchemaRDD schemas=ctx.jsonRDD(arg0);
schemas.insertInto(test, true);
JavaSchemaRDD teeagers=ctx.hql(SELECT a,b FROM test);
ListString teeagerNames1=teeagers.map(new FunctionRow,String()
Currently Spark SQL uses Hive 0.12.0, which doesn't support the INSERT
INTO ... VALUES ... syntax.
On 10/18/14 1:33 AM, arthur.hk.c...@gmail.com wrote:
Hi,
When trying to insert records into HIVE, I got error,
My Spark is 1.1.0 and Hive 0.12.0
Any idea what would be wrong?
Regards
Arthur
Hi Jenny, how did you configure the classpath and start the Thrift
server (YARN client/YARN cluster/standalone/...)?
On 10/18/14 4:14 AM, Jenny Zhao wrote:
Hi,
if Spark thrift JDBC server is started with non-secure mode, it is
working fine. with a secured mode in case of pluggable
In your JSON snippet, 111 and 222 are quoted, namely they are strings.
Thus they are automatically inferred as string rather than tinyint by
|jsonRDD|. Try this in Spark shell:
|val sparkContext = sc
import org.apache.spark.sql._
import sparkContext._
val sqlContext = new
Hi,
Is someone working on a project on integrating Oryx model serving layer
with Spark ? Models will be built using either Streaming data / Batch data
in HDFS and cross validated with mllib APIs but the model serving layer
will give API endpoints like Oryx
and read the models may be from
Does it retain the order if its pulling from the hdfs blocks, meaning
if file1 = a, b, c partition in order
if I convert to 2 partition read will it map to ab, c or a, bc or it can
also be a, cb ?
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi
Hi,
I have program that I have for single computer (in Python) exection and also
implemented the same for Spark. This program basically only reads .json from
which it takes one field and saves it back. Using Spark my program runs
aproximately 100 times slower on 1 master and 1 slave. So I
Hi,
I have following questions:
1. When I write a spark script, how do I know what part runs on the driver side
and what runs on the worker side.
So lets say, I write code to to read a plain text file.
Will it run on driver side only or will it run on server side only or on
both sides
How many files do you have and how big is each JSON object?
Spark works better with a few big files vs many smaller ones. So you could try
cat'ing your files together and rerunning the same experiment.
- Evan
On Oct 18, 2014, at 12:07 PM, jan.zi...@centrum.cz jan.zi...@centrum.cz
wrote:
Hi Lang,
If the Linux kernel on those machines recognize all the cores then Spark
will use them all naturally with no extra work. Are you seeing otherwise?
Andrew
On Oct 9, 2014 2:00 PM, Lang Yu lysubscr...@gmail.com wrote:
Hi,
Currently all the workloads are run on CPUs. Is it possible that
How many CPUs on the slave?
Because the overhead between JVM and Python, single task will be
slower than your local Python scripts, but it's very easy to scale to
many CPUs.
Even one CPUs, it's not common that PySpark was 100 times slower. You
have many small files, each file will be processed
toBreeze is private within Spark, it should not be accessible to users. If you
want to make a Breeze vector from an MLlib one, it's pretty straightforward,
and you can make your own utility function for it.
Matei
On Oct 17, 2014, at 5:09 PM, Sean Owen so...@cloudera.com wrote:
Yes, I
When doing groupby for big data,may be 500g,some partition tasks success,some
partition tasks fetchfailed error. Spark system retry previous stage,but
always fail
6 computers : 384g
Worker:40g*7 for one computer
Can anyone tell me why fetch failed???
The minPartitions argument of textFile/hadoopFile cannot decrease the
number of splits past the physical number of blocks/files. So if you have 3
HDFS blocks, asking for 2 minPartitions will still give you 3 partitions
(hence the min). It can, however, convert a file with fewer HDFS blocks
into
Hi ,
I have bunch of Xml files and I want to run spark SQL on it, is there a
recommended approach ? I am thinking of either converting Xml in json and
then jsonRDD
Please let me know your thoughts
Regards,
Gaurav
--
View this message in context:
Oops yes it is. Been working inside the Spark packages too long. Ignore
that comment.
On Oct 19, 2014 1:42 AM, Matei Zaharia matei.zaha...@gmail.com wrote:
toBreeze is private within Spark, it should not be accessible to users. If
you want to make a Breeze vector from an MLlib one, it's pretty
After successful events in the past two years, the Spark Summit conference has
expanded for 2015, offering both an event in New York on March 18-19 and one in
San Francisco on June 15-17. The conference is a great chance to meet people
from throughout the Spark community and see the latest
Side note: I thought bzip2 was splittable. Perhaps you meant gzip?
2014년 10월 18일 토요일, Aaron Davidsonilike...@gmail.com님이 작성한 메시지:
The minPartitions argument of textFile/hadoopFile cannot decrease the
number of splits past the physical number of blocks/files. So if you have 3
HDFS blocks,
20 matches
Mail list logo