Re: How to disable input split

2014-10-18 Thread Davies Liu
You can call coalesce() to merge the small splits into bigger ones. Davies On Fri, Oct 17, 2014 at 5:35 PM, Larry Liu larryli...@gmail.com wrote: Is it possible to disable input split if input is already small? - To

Re: input split size

2014-10-18 Thread Ilya Ganelin
Also - if you're doing a text file read you can pass the number of resulting partitions as the second argument. On Oct 17, 2014 9:05 PM, Larry Liu larryli...@gmail.com wrote: Thanks, Andrew. What about reading out of local? On Fri, Oct 17, 2014 at 5:38 PM, Andrew Ash and...@andrewash.com

a hivectx insertinto issue-can inertinto function be applied to a hive table

2014-10-18 Thread valgrind_girl
The complete code is as follows: JavaHiveContext ctx; JavaSchemaRDD schemas=ctx.jsonRDD(arg0); schemas.insertInto(test, true); JavaSchemaRDD teeagers=ctx.hql(SELECT a,b FROM test); ListString teeagerNames1=teeagers.map(new FunctionRow,String()

Re: Spark/HIVE Insert Into values Error

2014-10-18 Thread Cheng Lian
Currently Spark SQL uses Hive 0.12.0, which doesn't support the INSERT INTO ... VALUES ... syntax. On 10/18/14 1:33 AM, arthur.hk.c...@gmail.com wrote: Hi, When trying to insert records into HIVE, I got error, My Spark is 1.1.0 and Hive 0.12.0 Any idea what would be wrong? Regards Arthur

Re: Unable to connect to Spark thrift JDBC server with pluggable authentication

2014-10-18 Thread Cheng Lian
Hi Jenny, how did you configure the classpath and start the Thrift server (YARN client/YARN cluster/standalone/...)? On 10/18/14 4:14 AM, Jenny Zhao wrote: Hi, if Spark thrift JDBC server is started with non-secure mode, it is working fine. with a secured mode in case of pluggable

Re: a hivectx insertinto issue-can inertinto function be applied to a hive table

2014-10-18 Thread Cheng Lian
In your JSON snippet, 111 and 222 are quoted, namely they are strings. Thus they are automatically inferred as string rather than tinyint by |jsonRDD|. Try this in Spark shell: |val sparkContext = sc import org.apache.spark.sql._ import sparkContext._ val sqlContext = new

Fwd: Oryx + Spark mllib

2014-10-18 Thread Debasish Das
Hi, Is someone working on a project on integrating Oryx model serving layer with Spark ? Models will be built using either Streaming data / Batch data in HDFS and cross validated with mllib APIs but the model serving layer will give API endpoints like Oryx and read the models may be from

Re: input split size

2014-10-18 Thread Mayur Rustagi
Does it retain the order if its pulling from the hdfs blocks, meaning if file1 = a, b, c partition in order if I convert to 2 partition read will it map to ab, c or a, bc or it can also be a, cb ? Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi

Spark speed performance

2014-10-18 Thread jan.zikes
Hi, I have program that I have for single computer (in Python) exection and also implemented the same for Spark. This program basically only reads .json from which it takes one field and saves it back. Using Spark my program runs aproximately 100 times slower on 1 master and 1 slave. So I

What executes on worker and what executes on driver side

2014-10-18 Thread Saurabh Wadhawan
Hi, I have following questions: 1. When I write a spark script, how do I know what part runs on the driver side and what runs on the worker side. So lets say, I write code to to read a plain text file. Will it run on driver side only or will it run on server side only or on both sides

Re: Spark speed performance

2014-10-18 Thread Evan Sparks
How many files do you have and how big is each JSON object? Spark works better with a few big files vs many smaller ones. So you could try cat'ing your files together and rerunning the same experiment. - Evan On Oct 18, 2014, at 12:07 PM, jan.zi...@centrum.cz jan.zi...@centrum.cz wrote:

Re: Could Spark make use of Intel Xeon Phi?

2014-10-18 Thread Andrew Ash
Hi Lang, If the Linux kernel on those machines recognize all the cores then Spark will use them all naturally with no extra work. Are you seeing otherwise? Andrew On Oct 9, 2014 2:00 PM, Lang Yu lysubscr...@gmail.com wrote: Hi, Currently all the workloads are run on CPUs. Is it possible that

Re: Spark speed performance

2014-10-18 Thread Davies Liu
How many CPUs on the slave? Because the overhead between JVM and Python, single task will be slower than your local Python scripts, but it's very easy to scale to many CPUs. Even one CPUs, it's not common that PySpark was 100 times slower. You have many small files, each file will be processed

Re: mllib.linalg.Vectors vs Breeze?

2014-10-18 Thread Matei Zaharia
toBreeze is private within Spark, it should not be accessible to users. If you want to make a Breeze vector from an MLlib one, it's pretty straightforward, and you can make your own utility function for it. Matei On Oct 17, 2014, at 5:09 PM, Sean Owen so...@cloudera.com wrote: Yes, I

why fetch failed

2014-10-18 Thread marylucy
When doing groupby for big data,may be 500g,some partition tasks success,some partition tasks fetchfailed error. Spark system retry previous stage,but always fail 6 computers : 384g Worker:40g*7 for one computer Can anyone tell me why fetch failed???

Re: input split size

2014-10-18 Thread Aaron Davidson
The minPartitions argument of textFile/hadoopFile cannot decrease the number of splits past the physical number of blocks/files. So if you have 3 HDFS blocks, asking for 2 minPartitions will still give you 3 partitions (hence the min). It can, however, convert a file with fewer HDFS blocks into

Spark SQL on XML files

2014-10-18 Thread gtinside
Hi , I have bunch of Xml files and I want to run spark SQL on it, is there a recommended approach ? I am thinking of either converting Xml in json and then jsonRDD Please let me know your thoughts Regards, Gaurav -- View this message in context:

Re: mllib.linalg.Vectors vs Breeze?

2014-10-18 Thread Sean Owen
Oops yes it is. Been working inside the Spark packages too long. Ignore that comment. On Oct 19, 2014 1:42 AM, Matei Zaharia matei.zaha...@gmail.com wrote: toBreeze is private within Spark, it should not be accessible to users. If you want to make a Breeze vector from an MLlib one, it's pretty

Submissions open for Spark Summit East 2015

2014-10-18 Thread Matei Zaharia
After successful events in the past two years, the Spark Summit conference has expanded for 2015, offering both an event in New York on March 18-19 and one in San Francisco on June 15-17. The conference is a great chance to meet people from throughout the Spark community and see the latest

Re: input split size

2014-10-18 Thread Nicholas Chammas
Side note: I thought bzip2 was splittable. Perhaps you meant gzip? 2014년 10월 18일 토요일, Aaron Davidsonilike...@gmail.com님이 작성한 메시지: The minPartitions argument of textFile/hadoopFile cannot decrease the number of splits past the physical number of blocks/files. So if you have 3 HDFS blocks,