Fwd: Spark partition size tuning

2016-01-25 Thread Jia Zou
Dear all, First to update that the local file system data partition size can be tuned by: sc.hadoopConfiguration().setLong("fs.local.block.size", blocksize) However, I also need to tune Spark data partition size for input data that is stored in Tachyon (default is 512MB), but above method can't

spark-sql[1.4.0] not compatible hive sql when using in with date_sub or regexp_replace

2016-01-25 Thread our...@cnsuning.com
hi , all when migrating hive sql to spark sql encountor a incompatibility problem . Please give me some suggestions. hive table description and data format as following : 1 use spark; drop table spark.test_or1; CREATE TABLE `spark.test_or1`( `statis_date` string, `lbl_nm` string)

multi-threaded Spark jobs

2016-01-25 Thread Elango Cheran
Hi everyone, I've gone through the effort of figuring out how to modify a Spark job to have an operation become multi-threaded inside an executor. I've written up an explanation of what worked, what didn't work, and why:

Re: multi-threaded Spark jobs

2016-01-25 Thread Igor Berman
IMHO, you are making mistake. spark manages tasks and cores internally. when you open new threads inside executor - meaning you "over-provisioning" executor(e.g. tasks on other cores will be preempted) On 26 January 2016 at 07:59, Elango Cheran wrote: > Hi everyone, >

hivethriftserver2 problems on upgrade to 1.6.0

2016-01-25 Thread james.gre...@baesystems.com
On upgrade from 1.5.0 to 1.6.0 I have a problem with the hivethriftserver2, I have this code: val hiveContext = new HiveContext(SparkContext.getOrCreate(conf)); val thing = hiveContext.read.parquet("hdfs://dkclusterm1.imp.net:8020/user/jegreen1/ex208") thing.registerTempTable("thing")

Re: How to discretize Continuous Variable with Spark DataFrames

2016-01-25 Thread Jeff Zhang
It's very straightforward, please refer the document here http://spark.apache.org/docs/latest/ml-features.html#bucketizer On Mon, Jan 25, 2016 at 10:09 PM, Eli Super wrote: > Thanks Joshua , > > I can't understand what algorithm behind Bucketizer , how discretization >

[Spark] Reading avro file in Spark 1.3.0

2016-01-25 Thread diplomatic Guru
Hello guys, I've been trying to read avro file using Spark's DataFrame but it's throwing this error: java.lang.NoSuchMethodError: org.apache.spark.sql.SQLContext.read()Lorg/apache/spark/sql/DataFrameReader; This is what I've done so far: I've added the dependency to pom.xml:

Re: Launching EC2 instances with Spark compiled for Scala 2.11

2016-01-25 Thread Darren Govoni
Why not deploy it. Then build a custom distribution with Scala 2.11 and just overlay it. Sent from my Verizon Wireless 4G LTE smartphone Original message From: Nuno Santos Date: 01/25/2016 7:38 AM (GMT-05:00) To: user@spark.apache.org Subject:

Sharing HiveContext in Spark JobServer / getOrCreate

2016-01-25 Thread Deenar Toraskar
Hi I am using a shared sparkContext for all of my Spark jobs. Some of the jobs use HiveContext, but there isn't a getOrCreate method on HiveContext which will allow reuse of an existing HiveContext. Such a method exists on SQLContext only (def getOrCreate(sparkContext: SparkContext): SQLContext).

Re: Launching EC2 instances with Spark compiled for Scala 2.11

2016-01-25 Thread Nuno Santos
Hello, Any updates on this question? I'm also very interested in a solution, as I'm trying to use Spark on EC2 but need Scala 2.11 support. The scripts in the ec2 directory of the Spark distribution install use Scala 2.10 by default and I can't see any obvious option to change to Scala 2.11.

Re: How to discretize Continuous Variable with Spark DataFrames

2016-01-25 Thread Eli Super
Thanks Joshua , I can't understand what algorithm behind Bucketizer , how discretization done ? Best Regards On Mon, Jan 25, 2016 at 3:40 PM, Joshua TAYLOR wrote: > It sounds like you may want the Bucketizer in SparkML. The overview docs > [1] include, "Bucketizer

Re: 10hrs of Scheduler Delay

2016-01-25 Thread Ted Yu
Yes, thread dump plus log would be helpful for debugging. Thanks > On Jan 25, 2016, at 5:59 AM, Sanders, Isaac B > wrote: > > Is the thread dump the stack trace you are talking about? If so, I will see > if I can capture the few different stages I have seen it in.

Re: How to discretize Continuous Variable with Spark DataFrames

2016-01-25 Thread Joshua TAYLOR
It sounds like you may want the Bucketizer in SparkML. The overview docs [1] include, "Bucketizer transforms a column of continuous features to a column of feature buckets, where the buckets are specified by users." [1]: http://spark.apache.org/docs/latest/ml-features.html#bucketizer On Mon,

Re: 10hrs of Scheduler Delay

2016-01-25 Thread Darren Govoni
Yeah. I have screenshots and stack traces. I will post them to the ticket. Nothing informative. I should also mention I'm using pyspark but I think the deadlock is inside the Java scheduler code. Sent from my Verizon Wireless 4G LTE smartphone Original message From:

Re: 10hrs of Scheduler Delay

2016-01-25 Thread Ted Yu
Opening a JIRA is fine. See if you can capture stack trace during the hung stage and attach to JIRA so that we have more clue. Thanks > On Jan 25, 2016, at 4:25 AM, Darren Govoni wrote: > > Probably we should open a ticket for this. > There's definitely a deadlock

Re: 10hrs of Scheduler Delay

2016-01-25 Thread Sanders, Isaac B
Is the thread dump the stack trace you are talking about? If so, I will see if I can capture the few different stages I have seen it in. Thanks for the help, I was able to do it for 0.1% of my data. I will create the JIRA. Thanks, Isaac On Jan 25, 2016, at 8:51 AM, Ted Yu

Running kafka consumer in local mode - error - connection timed out

2016-01-25 Thread Supreeth
We are running a Kafka Consumer in local mode using Spark Streaming, KafkaUtils.createDirectStream. The job runs as expected, however once in a very long time time, I see the following exception. Wanted to check if others have faced a similar issue, and what are the right timeout parameters to

Re: Spark integration with HCatalog (specifically regarding partitions)

2016-01-25 Thread Elliot West
Thanks for your response Jorge and apologies for my delay in replying. I took your advice with case 5 and declared the column names explicitly instead of the wildcard. This did the trick and I can now add partitions to an existing table. I also tried removing the 'partitionBy("id")' call as

Re: Running kafka consumer in local mode - error - connection timed out

2016-01-25 Thread Cody Koeninger
Should be socket.timeout.ms on the map of kafka config parameters. The lack of retry is probably due to the differences between running spark in local mode vs standalone / mesos / yarn. On Mon, Jan 25, 2016 at 1:19 PM, Supreeth wrote: > We are running a Kafka Consumer

Re: bug for large textfiles on windows

2016-01-25 Thread Josh Rosen
Hi Christopher, What would be super helpful here is a standalone reproduction. Ideally this would be a single Scala file or set of commands that I can run in `spark-shell` in order to reproduce this. Ideally, this code would generate a giant file, then try to read it in a way that demonstrates

Re: How to setup a long running spark streaming job with continuous window refresh

2016-01-25 Thread Tathagata Das
You can use a 1 minute tumbling window dstream.window(Minutes(1), Minutes(1)).foreachRDD { rdd => // calculate stats per key } On Thu, Jan 21, 2016 at 4:59 AM, Santoshakhilesh < santosh.akhil...@huawei.com> wrote: > Hi, > > I have following scenario in my project; > > 1.I will continue

Re: how to build spark with out hive

2016-01-25 Thread Ted Yu
Spark 1.5.2. depends on slf4j 1.7.10 Looks like there was another version of slf4j on the classpath. FYI On Mon, Jan 25, 2016 at 12:19 AM, kevin wrote: > HI,all > I need to test hive on spark ,to use spark as the hive's execute > engine. > I download the spark

Re: NA value handling in sparkR

2016-01-25 Thread Deborah Siegel
Maybe not ideal, but since read.df is inferring all columns from the csv containing "NA" as type of strings, one could filter them rather than using dropna(). filtered_aq <- filter(aq, aq$Ozone != "NA" & aq$Solar_R != "NA") head(filtered_aq) Perhaps it would be better to have an option for

bug for large textfiles on windows

2016-01-25 Thread Christopher Bourez
Dears, I would like to re-open a case for a potential bug (current status is resolved but it sounds not) : *https://issues.apache.org/jira/browse/SPARK-12261 * I believe there is something wrong about the memory management under windows It has

Determine Topic MetaData Spark Streaming Job

2016-01-25 Thread Ashish Soni
Hi All , What is the best way to tell spark streaming job for the no of partition to to a given topic - Should that be provided as a parameter or command line argument or We should connect to kafka in the driver program and query it Map fromOffsets = new

a question about web ui log

2016-01-25 Thread Philip Lee
​Hello, a questino about web UI log. ​I could see web interface log after forwarding the port on my cluster to my local and click completed application, but when I clicked "application detail UI" [image: Inline image 1] It happened to me. I do not know why. I also checked the specific log

Re: Determine Topic MetaData Spark Streaming Job

2016-01-25 Thread Ashish Soni
Correct what i am trying to achieve is that before the streaming job starts query the topic meta data from kafka , determine all the partition and provide those to direct API. So my question is should i consider passing all the partition from command line and query kafka and find and provide ,

Re: Determine Topic MetaData Spark Streaming Job

2016-01-25 Thread Gerard Maas
What are you trying to achieve? Looks like you want to provide offsets but you're not managing them and I'm assuming you're using the direct stream approach. In that case, use the simpler constructor that takes the kafka config and the topics. Let it figure it out the offsets (it will contact

Re: Determine Topic MetaData Spark Streaming Job

2016-01-25 Thread Gerard Maas
That's precisely what this constructor does: KafkaUtils.createDirectStream[...](ssc, kafkaConfig, topics) Is there a reason to do that yourself? In that case, look at how it's done in Spark Streaming for inspiration:

Re: [Spark] Reading avro file in Spark 1.3.0

2016-01-25 Thread Kevin Mellott
I think that you may be looking at documentation pertaining to the more recent versions of Spark. Try looking at the examples linked below, which applies to the Spark 1.3 version. There aren't many Java examples, but the code should be very similar to the Scala ones (i.e. using "load" instead of

Re: a question about web ui log

2016-01-25 Thread Philip Lee
As I mentioned before, I am tryint to see the spark log on a cluster via ssh-tunnel 1) The error on application details UI is probably from monitoring porting ​4044. Web UI port is 8088, right? so how could I see job web ui view and application details UI view in the web ui on my local machine?

Re: Spark master takes more time with local[8] than local[1]

2016-01-25 Thread nsalian
Hi, Thanks for the question. Is it possible for you to elaborate on your application? The flow of the application will help to understand what could potentially cause things to slow down? Do logs give you any idea what goes on? Have you had a chance to look? Thank you. - Neelesh S.

Spark DataFrame Catalyst - Another Oracle like query optimizer?

2016-01-25 Thread Nirav Patel
Hi, Perhaps I should write a blog about this that why spark is focusing more on writing easier spark jobs and hiding underlaying performance optimization details from a seasoned spark users. It's one thing to provide such abstract framework that does optimization for you so you don't have to

Running kafka consumer in local mode - error - connection timed out

2016-01-25 Thread Supreeth
We are running a Kafka Consumer in local mode using Spark Streaming, KafkaUtils.createDirectStream. The job runs as expected, however once in a very long time time, I see the following exception. Wanted to check if others have faced a similar issue, and what are the right timeout parameters to

Re: Spark DataFrame Catalyst - Another Oracle like query optimizer?

2016-01-25 Thread Mark Hamstra
What do you think is preventing you from optimizing your own RDD-level transformations and actions? AFAIK, nothing that has been added in Catalyst precludes you from doing that. The fact of the matter is, though, that there is less type and semantic information available to Spark from the raw

Re: Sharing HiveContext in Spark JobServer / getOrCreate

2016-01-25 Thread Ted Yu
Have you noticed the following method of HiveContext ? * Returns a new HiveContext as new session, which will have separated SQLConf, UDF/UDAF, * temporary tables and SessionState, but sharing the same CacheManager, IsolatedClientLoader * and Hive client (both of execution and metadata)

Re: bug for large textfiles on windows

2016-01-25 Thread Christopher Bourez
The same problem occurs on my desktop at work. What's great with AWS Workspace is that you can easily reproduce it. I created the test file with commands : for i in {0..30}; do VALUE="$RANDOM" for j in {0..6}; do VALUE="$VALUE;$RANDOM"; done echo $VALUE >> test.csv done

Re: Sharing HiveContext in Spark JobServer / getOrCreate

2016-01-25 Thread Deenar Toraskar
On 25 January 2016 at 21:09, Deenar Toraskar < deenar.toras...@thinkreactive.co.uk> wrote: > No I hadn't. This is useful, but in some cases we do want to share the > same temporary tables between jobs so really wanted a getOrCreate > equivalent on HIveContext. > > Deenar > > > > On 25 January

Re: Running kafka consumer in local mode - error - connection timed out

2016-01-25 Thread Supreeth
Hmm, thanks for the response. The current value I have for socket.timeout.ms is 12. I am not sure if this needs a higher value, not much from the logs. The retry aspect makes sense, I can work around the same. -S On Mon, Jan 25, 2016 at 11:51 AM, Cody Koeninger wrote:

Re: Spark RDD DAG behaviour understanding in case of checkpointing

2016-01-25 Thread Tathagata Das
First of all, if you are running batches of 15 minutes, and you dont need second level latencies, it might be just easier to run batch jobs in a for loop - you will have greater control over what is going on. And if you are using reduceByKeyAndWindow without the inverseReduceFunction, then Spark

Re: Trouble dropping columns from a DataFrame that has other columns with dots in their names

2016-01-25 Thread Michael Armbrust
Looks like you found a bug. I've filed them here: SPARK-12987 - Drop fails when columns contain dots SPARK-12988 - Can't drop columns that contain dots On Fri, Jan 22, 2016 at 3:18 PM, Joshua

Datasets and columns

2016-01-25 Thread Steve Lewis
assume I have the following code SparkConf sparkConf = new SparkConf(); JavaSparkContext sqlCtx= new JavaSparkContext(sparkConf); JavaRDD rddMyType= generateRDD(); // some code Encoder evidence = Encoders.kryo(MyType.class); Dataset datasetMyType= sqlCtx.createDataset( rddMyType.rdd(),

Re: streaming textFileStream problem - got only ONE line

2016-01-25 Thread Shixiong(Ryan) Zhu
Did you move the file into "hdfs://helmhdfs/user/patcharee/cerdata/", or write into it directly? `textFileStream` requires that files must be written to the monitored directory by "moving" them from another location within the same file system. On Mon, Jan 25, 2016 at 6:30 AM, patcharee

Trouble dropping columns from a DataFrame that has other columns with dots in their names

2016-01-25 Thread Joshua TAYLOR
(Apologies if this has arrived more than once. I've subscribed to the list, and tried posting via email with no success. This an intentional repost to see if things are going through yet.) I've been having lots of trouble with DataFrames whose columns have dots in their names today. I know

Can Spark read input data from HDFS centralized cache?

2016-01-25 Thread Jia Zou
I configured HDFS to cache file in HDFS's cache, like following: hdfs cacheadmin -addPool hibench hdfs cacheadmin -addDirective -path /HiBench/Kmeans/Input -pool hibench But I didn't see much performance impacts, no matter how I configure dfs.datanode.max.locked.memory Is it possible that

Re: mapWithState and context start when checkpoint exists

2016-01-25 Thread Shixiong(Ryan) Zhu
Hey Andrey, `ConstantInputDStream` doesn't support checkpoint as it contains an RDD field. It cannot resume from checkpoints. On Mon, Jan 25, 2016 at 1:12 PM, Andrey Yegorov wrote: > Hi, > > I am new to spark (and scala) and hope someone can help me with the issue > I

Re: Can Spark read input data from HDFS centralized cache?

2016-01-25 Thread Ted Yu
Have you read this thread ? http://search-hadoop.com/m/uOzYttXZcg1M6oKf2/HDFS+cache=RE+hadoop+hdfs+cache+question+do+client+processes+share+cache+ Cheers On Mon, Jan 25, 2016 at 1:23 PM, Jia Zou wrote: > I configured HDFS to cache file in HDFS's cache, like following:

Re: bug for large textfiles on windows

2016-01-25 Thread Christopher Bourez
Josh, Thanks a lot ! You can download a video I created : https://s3-eu-west-1.amazonaws.com/christopherbourez/public/video.mov I created a sample file of 13 MB as explained : https://s3-eu-west-1.amazonaws.com/christopherbourez/public/test.csv Here are the commands I did : I created an Aws

Re: bug for large textfiles on windows

2016-01-25 Thread Christopher Bourez
Here is a pic of memory If I put --conf spark.driver.memory=3g, it increases the displaid memory, but the problem remains... for a file that is only 13M. Christopher Bourez 06 17 17 50 60 On Mon, Jan 25, 2016 at 10:06 PM, Christopher Bourez < christopher.bou...@gmail.com> wrote: > The same

Re: Trouble dropping columns from a DataFrame that has other columns with dots in their names

2016-01-25 Thread Joshua TAYLOR
Thanks Michael, hopefully those will get some attention for a not too distant release. Do you think that this is related to, or separate from, a similar issue [1] that a filed a bit earlier, regarding the way that StringIndexer (and perhaps other ML components) handles some of these columns?

Re: Datasets and columns

2016-01-25 Thread Michael Armbrust
The encoder is responsible for mapping your class onto some set of columns. Try running: datasetMyType.printSchema() On Mon, Jan 25, 2016 at 1:16 PM, Steve Lewis wrote: > assume I have the following code > > SparkConf sparkConf = new SparkConf(); > > JavaSparkContext

mapWithState and context start when checkpoint exists

2016-01-25 Thread Andrey Yegorov
Hi, I am new to spark (and scala) and hope someone can help me with the issue I got stuck on in my experiments/learning. mapWithState from spark 1.6 seems to be a great way for the task I want to implement with spark but unfortunately I am getting error "RDD transformations and actions can only

Re: mapWithState and context start when checkpoint exists

2016-01-25 Thread Andrey Yegorov
Thank you! what would be the best alternative to simulate a stream for testing purposes from e.g. sequence or a text file? In production I'll use kafka as a source but locally I wanted to mock it. Worst case scenario I'll have setup/tear down kafka cluster in tests but I think having a mock will

Re: Datasets and columns

2016-01-25 Thread Michael Armbrust
There is no public API for custom encoders yet, but since your class looks like a bean you should be able to use the `bean` method instead of `kryo`. This will expose the actual columns. On Mon, Jan 25, 2016 at 2:04 PM, Steve Lewis wrote: > Ok when I look at the schema it

Re: cast column string -> timestamp in Parquet file

2016-01-25 Thread Cheng Lian
The following snippet may help: sqlContext.read.parquet(path).withColumn("col_ts", $"col".cast(TimestampType)).drop("col") Cheng On 1/21/16 6:58 AM, Muthu Jayakumar wrote: DataFrame and udf. This may be more performant than doing an RDD transformation as you'll only transform just the

Re: MLlib OneVsRest causing intermittent exceptions

2016-01-25 Thread Ram Sriharsha
Hi David What happens if you provide the class labels via metadata instead of letting OneVsRest determine the labels? Ram On Mon, Jan 25, 2016 at 3:06 PM, David Brooks wrote: > Hi, > > I've run into an exception using MLlib OneVsRest with logistic regression > (v1.6.0, but

Standalone scheduler issue - one job occupies the whole cluster somehow

2016-01-25 Thread Mikhail Strebkov
Hi all, Recently we started having issues with one of our background processing scripts which we run on Spark. The cluster runs only two jobs. One job runs for days, and another is usually like a couple of hours. Both jobs have a crob schedule. The cluster is small, just 2 slaves, 24 cores, 25.4

MLlib OneVsRest causing intermittent exceptions

2016-01-25 Thread David Brooks
Hi, I've run into an exception using MLlib OneVsRest with logistic regression (v1.6.0, but also in previous versions). The issue is intermittent. When running multiclass classification with K-fold cross validation, there are scenarios where the split does not contain instances for every target

Re: Spark Streaming Write Ahead Log (WAL) not replaying data after restart

2016-01-25 Thread Shixiong(Ryan) Zhu
You need to define a create function and use StreamingContext.getOrCreate. See the example here: http://spark.apache.org/docs/latest/streaming-programming-guide.html#how-to-configure-checkpointing On Thu, Jan 21, 2016 at 3:32 AM, Patrick McGloin wrote: > Hi all, > >

Re: Datasets and columns

2016-01-25 Thread Steve Lewis
Ok when I look at the schema it looks like KRYO makes one column is there a way to do a custom encoder with my own columns On Jan 25, 2016 1:30 PM, "Michael Armbrust" wrote: > The encoder is responsible for mapping your class onto some set of > columns. Try running:

understanding iterative algorithms in Spark

2016-01-25 Thread Raghava
Hello All, I am new to Spark and I am trying to understand how iterative application of operations are handled in Spark. Consider the following program in Scala. var u = sc.textFile(args(0)+"s1.txt").map(line => { line.split("\\|") match { case Array(x,y) => (y.toInt,x.toInt)}})

Generic Dataset Aggregator

2016-01-25 Thread Deenar Toraskar
Hi All https://docs.cloud.databricks.com/docs/spark/1.6/index.html#examples/Dataset%20Aggregator.html I have been converting my UDAFs to Dataset (Dataset's are cool BTW) Aggregators. I have an ArraySum aggregator that does an element wise sum or arrays. I have got the simple version working, but

RE: a question about web ui log

2016-01-25 Thread Mohammed Guller
I am not sure whether you can copy the log files from Spark workers to your local machine and view it from the Web UI. In fact, if you are able to copy the log files locally, you can just view them directly in any text editor. I suspect what you really want to see is the application history.

Re: Spark Streaming - Custom ReceiverInputDStream ( Custom Source) In java

2016-01-25 Thread Tathagata Das
See how other Java wrapper classes use JavaSparkContext.fakeClassTag example; https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/api/java/JavaMapWithStateDStream.scala On Fri, Jan 22, 2016 at 2:00 AM, Nagu Kothapalli wrote:

Re: Concurrent Spark jobs

2016-01-25 Thread emlyn
Jean wrote > Have you considered using pools? > http://spark.apache.org/docs/latest/job-scheduling.html#fair-scheduler-pools > > I haven't tried that by myself, but it looks like pool setting is applied > per thread so that means it's possible to configure fair scheduler, so > that more, than one

Re: Spark DataFrame Catalyst - Another Oracle like query optimizer?

2016-01-25 Thread Nirav Patel
I haven't gone through much details of spark catalyst optimizer and tungston project but we have been advised by databricks support to use DataFrame to resolve issues with OOM error that we are getting during Join and GroupBy operations. We use spark 1.3.1 and looks like it can not perform

Re: Spark RDD DAG behaviour understanding in case of checkpointing

2016-01-25 Thread Cody Koeninger
Where are you calling checkpointing? Metadata checkpointing for a kafa direct stream should just be the offsets, not the data. TD can better speak to reduceByKeyAndWindow behavior when restoring from a checkpoint, but ultimately the only available choices would be replay the prior window data

Re: Can Spark read input data from HDFS centralized cache?

2016-01-25 Thread Ted Yu
Please see also: http://hadoop.apache.org/docs/r2.7.1/hadoop-project-dist/hadoop-hdfs/CentralizedCacheManagement.html According to Chris Nauroth, an hdfs committer, it's extremely difficult to use the feature correctly. The feature also brings operational complexity. Since off-heap memory is

how to build spark with out hive

2016-01-25 Thread kevin
HI,all I need to test hive on spark ,to use spark as the hive's execute engine. I download the spark source 1.5.2 from apache web-site. I have installed maven3.3.9 and scala 2.10.6 ,so I change the /make-distribution.sh to point to my mvn location where I installed. then I run

SparkSQL : "select non null values from column"

2016-01-25 Thread Eli Super
Hi I try to select all values but not NULL values from column contains NULL values with sqlContext.sql("select my_column from my_table where my_column <> null ").show(15) or sqlContext.sql("select my_column from my_table where my_column != null ").show(15) I get empty result Thanks !

RangePartitioning skewed data

2016-01-25 Thread jluan
Lets say I have a dataset of (K,V) where the keys are really skewed: myDataRDD = [(8, 1), (8, 13), (1,1), (2,4)] [(8, 12), (8, 15), (8, 7), (8, 6), (8, 4), (8, 3), (8, 4), (10,2)] If I applied a RangePartitioner to this set of data, say val rangePart = new RangePartitioner(4, myDataRDD) and

How to discretize Continuous Variable with Spark DataFrames

2016-01-25 Thread Eli Super
Hi What is a best way to discretize Continuous Variable within Spark DataFrames ? I want to discretize some variable 1) by equal frequency 2) by k-means I usually use R for this porpoises _http://www.inside-r.org/packages/cran/arules/docs/discretize R code for example : ### equal frequency

Re: SparkSQL : "select non null values from column"

2016-01-25 Thread Deng Ching-Mallete
Hi, Have you tried using IS NOT NULL for the where condition? Thanks, Deng On Mon, Jan 25, 2016 at 7:00 PM, Eli Super wrote: > Hi > > I try to select all values but not NULL values from column contains NULL > values > > with > > sqlContext.sql("select my_column from

Re: NA value handling in sparkR

2016-01-25 Thread Devesh Raj Singh
Hi, Yes you are right. I think the problem is with reading of csv files. read.df is not considering NAs in the CSV file So what would be a workable solution in dealing with NAs in csv files? On Mon, Jan 25, 2016 at 2:31 PM, Deborah Siegel wrote: > Hi Devesh, > >

SparkSQL return all null fields when FIELDS TERMINATED BY '\t' and have a partition.

2016-01-25 Thread Liu Yiding
Hi, all I am using CDH 5.5(spark 1.5 and hive 1.1). I occurred a strange problem. In hive: hive> create table `tmp.test_d`(`id` int, `name` string) PARTITIONED BY (`dt` string) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'; hive> load data local inpath

Re: [Streaming-Kafka] How to start from topic offset when streamcontext is using checkpoint

2016-01-25 Thread Yash Sharma
For specific offsets you can directly pass the offset ranges and use the KafkaUtils. createRDD to get the events those were missed in the Dstream. - Thanks, via mobile, excuse brevity. On Jan 25, 2016 3:33 PM, "Raju Bairishetti" wrote: > Hi Yash, >Basically, my question is

Undefined job output-path error in Spark on hive

2016-01-25 Thread Akhilesh Pathodia
Hi, I am getting following exception in Spark while writing to hive partitioned table in parquet format: 16/01/25 03:56:40 ERROR executor.Executor: Exception in task 0.2 in stage 1.0 (TID 3) java.io.IOException: Undefined job output-path at

Getting top distinct strings from arraylist

2016-01-25 Thread Patrick Plaatje
Hi, I’m quite new to Spark and MR, but have a requirement to get all distinct values with their respective counts from a transactional file. Let’s assume the following file format: 0 1 2 3 4 5 6 7 1 3 4 5 8 9 9 10 11 12 13 14 15 16 17 18 1 4 7 11 12 13 19 20 3 4 7 11 15 20 21 22 23 1 2 5 9 11

Re: 10hrs of Scheduler Delay

2016-01-25 Thread Darren Govoni
Probably we should open a ticket for this.There's definitely a deadlock situation occurring in spark under certain conditions. The only clue I have is it always happens on the last stage. And it does seem sensitive to scale. If my job has 300mb of data I'll see the deadlock. But if I only

Re: Dataframe, Spark SQL - Drops First 8 Characters of String on Amazon EMR

2016-01-25 Thread awzurn
Sorry for the bump, but wondering if anyone else has seen this before. We're hoping to either resolve this soon, or move on with further steps to move this into an issue. Thanks in advance, Andrew Zurn -- View this message in context: