How to add a typesafe config file which is located on HDFS to spark-submit (cluster-mode)?

2016-02-22 Thread Johannes Haaß
Hi, I have a Spark (Spark 1.5.2) application that streams data from Kafka to HDFS. My application contains two Typesafe config files to configure certain things like Kafka topic etc. Now I want to run my application with spark-submit (cluster mode) in a cluster. The jar file with all dependencies

Re: Using functional programming rather than SQL

2016-02-22 Thread Michał Zieliński
Your SQL query will look something like that in DataFrames (but as Ted said, check the docs to see the signatures). smallsales .join(times,"time_id") .join(channels,"channel_id") .groupBy("calendar_month_desc","channel_desc") .agg(sum(col("amount_sold")).as("TotalSales"),

Re: Spark SQL is not returning records for hive bucketed tables on HDP

2016-02-22 Thread @Sanjiv Singh
Hi Varadharajan, That is the point, Spark SQL is able to recognize delta files. See below directory structure, ONE BASE (43 records) and one DELTA (created after last insert). And I am able see last insert through Spark SQL. *See below complete scenario :* *Steps:* - Inserted 43 records

Re: Spark SQL is not returning records for hive bucketed tables on HDP

2016-02-22 Thread Varadharajan Mukundan
Hi Sanjiv, Yes.. If we make use of Hive JDBC we should be able to retrieve all the rows since it is hive which processes the query. But i think the problem with Hive JDBC is that there are two layers of processing, hive and then at spark with the result set. And another one is performance is

Re: Spark SQL is not returning records for hive bucketed tables on HDP

2016-02-22 Thread @Sanjiv Singh
Hi Varadharajan, Can you elaborate on (you quoted on previous mail) : "I observed that hive transaction storage structure do not work with spark yet" If it is related to delta files created after each transaction and spark would not be able recognize them. then I have a table *mytable *(ORC ,

Re: spark 1.6 Not able to start spark

2016-02-22 Thread fightf...@163.com
I think this may be some permission issue. Check your spark conf for hadoop related. fightf...@163.com From: Arunkumar Pillai Date: 2016-02-23 14:08 To: user Subject: spark 1.6 Not able to start spark Hi When i try to start spark-shell I'm getting following error Exception in thread

Re: spark 1.6 Not able to start spark

2016-02-22 Thread Ted Yu
Which Hadoop release did you build Spark against ? Can you give the full stack trace ? > On Feb 22, 2016, at 9:38 PM, Arunkumar Pillai wrote: > > Hi When i try to start spark-shell > I'm getting following error > > > Exception in thread "main"

RE: Sample project on Image Processing

2016-02-22 Thread Mishra, Abhishek
Thank you Everyone. I am to work on PoC with 2 types of images, that basically will be two PoC’s. Face recognition and Map data processing. I am looking to these links and hopefully will get an idea. Thanks again. Will post the queries as and when I get doubts. Sincerely, Abhishek From:

spark 1.6 Not able to start spark

2016-02-22 Thread Arunkumar Pillai
Hi When i try to start spark-shell I'm getting following error Exception in thread "main" java.lang.RuntimeException: java.lang.reflect.InvocationTargetException at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:131) at

Re: Spark UI documentaton needed

2016-02-22 Thread nsalian
Hi Ajay, Feel free to open a JIRA with the fields that you think are missing and what kind of documentation you wish to see. It would be best to have it in a JIRA to actually track and triage your suggestions. Thank you. - Neelesh S. Salian Cloudera -- View this message in context:

[Example] : Save dataframes with different schema + Spark 1.5.2 and Dataframe + Spark-CSV package

2016-02-22 Thread Divya Gehlot
Hi, My usecase : Have two datsets1 like below : year make model comment blank Carname 2012 Tesla S No comment 1997 Ford E350 Go get one now they are going fast MyFord 2015 Chevy Volt 2016 Mercedes Datset2 carowner year make model John 2012 Tesla S David Peter 1997 Ford E350 Paul 2015 Chevy Volt

Spark UI documentaton needed

2016-02-22 Thread Ajay Gupta
Hi Sparklers, Can you guys give an elaborate documentation of Spark UI as there are many fields in it and we do not know much about it. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-UI-documentaton-needed-tp26300.html Sent from the Apache Spark

Checkpointing with Kafka streaming

2016-02-22 Thread p pathiyil
Hi, While using Kafka streaming with the direct API, does the checkpoint consist of more than the Kafka offset if no 'window' operations are being used ? Is there any facility to check the contents of the checkpoint files ? Thanks.

RE: Can we load csv partitioned data into one DF?

2016-02-22 Thread Mohammed Guller
Are all the csv files in the same directory? Mohammed Author: Big Data Analytics with Spark From: saif.a.ell...@wellsfargo.com [mailto:saif.a.ell...@wellsfargo.com] Sent: Monday, February 22, 2016 7:25 AM To:

Re: Force Partitioner to use entire entry of PairRDD as key

2016-02-22 Thread Jay Luan
Thank you, that helps a lot. On Mon, Feb 22, 2016 at 6:01 PM, Takeshi Yamamuro wrote: > You're correct, reduceByKey is just an example. > > On Tue, Feb 23, 2016 at 10:57 AM, Jay Luan wrote: > >> Could you elaborate on how this would work? >> >> So

Re: Force Partitioner to use entire entry of PairRDD as key

2016-02-22 Thread Takeshi Yamamuro
You're correct, reduceByKey is just an example. On Tue, Feb 23, 2016 at 10:57 AM, Jay Luan wrote: > Could you elaborate on how this would work? > > So from what I can tell, this maps a key to a tuple which always has a 0 > as the second element. From there the hash widely

Re: Force Partitioner to use entire entry of PairRDD as key

2016-02-22 Thread Jay Luan
Could you elaborate on how this would work? So from what I can tell, this maps a key to a tuple which always has a 0 as the second element. From there the hash widely changes because we now hash something like ((1,4), 0) and ((1,3), 0). Thus mapping this would create more even partitions. Why

Re: SparkMaster IP

2016-02-22 Thread Arko Provo Mukherjee
Passing --host localhost solved the issue, thanks! Warm regards Arko On Mon, Feb 22, 2016 at 5:44 PM, Jakob Odersky wrote: > Spark master by default binds to whatever ip address your current host > resolves to. You have a few options to change that: > - override the ip by

Re: Force Partitioner to use entire entry of PairRDD as key

2016-02-22 Thread Takeshi Yamamuro
Hi, How about adding dummy values? values.map(d => (d, 0)).reduceByKey(_ + _) On Tue, Feb 23, 2016 at 10:15 AM, jluan wrote: > I was wondering, is there a way to force something like the hash > partitioner > to use the entire entry of a PairRDD as a hash rather than just

Re: SparkMaster IP

2016-02-22 Thread Jakob Odersky
Spark master by default binds to whatever ip address your current host resolves to. You have a few options to change that: - override the ip by setting the environment variable SPARK_LOCAL_IP - change the ip in your local "hosts" file (/etc/hosts on linux, not sure on windows) - specify a

Force Partitioner to use entire entry of PairRDD as key

2016-02-22 Thread jluan
I was wondering, is there a way to force something like the hash partitioner to use the entire entry of a PairRDD as a hash rather than just the key? For Example, if we have an RDD with values: PairRDD = [(1,4), (1, 3), (2, 3), (2,5), (2, 10)]. Rather than using keys 1 and 2, can we force the

SparkMaster IP

2016-02-22 Thread Arko Provo Mukherjee
Hello, I am running Spark on Windows. I start up master as follows: .\spark-class.cmd org.apache.spark.deploy.master.Master I see that the SparkMaster doesn't start on 127.0.0.1 but starts on my "actual" IP. This is troublesome for me as I use it in my code and need to change every time I

Re: Using functional programming rather than SQL

2016-02-22 Thread Ted Yu
Mich: Please refer to the following test suite for examples on various DataFrame operations: sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala On Mon, Feb 22, 2016 at 4:39 PM, Mich Talebzadeh < mich.talebza...@cloudtechnologypartners.co.uk> wrote: > Thanks Dean. > > I gather if

Re: Option[Long] parameter in case class parsed from JSON DataFrame failing when key not present in JSON

2016-02-22 Thread Jakob Odersky
I think the issue is that the `json.read` function has no idea of the underlying schema, in fact the documentation (http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrameReader) says: > Unless the schema is specified using schema function, this function goes >

Re: Using functional programming rather than SQL

2016-02-22 Thread Mich Talebzadeh
Thanks Dean. I gather if I wanted to get the whole thing through FP with little or no use of SQL, then for the first instance as I get the data set from Hive (i.e, val rs = HiveContext.sql("""SELECT t.calendar_month_desc, c.channel_desc, SUM(s.amount_sold) AS TotalSales FROM smallsales s,

Re: Using functional programming rather than SQL

2016-02-22 Thread Koert Kuipers
however to really enjoy functional programming i assume you also want to use lambda in your map and filter, which means you need to convert DataFrame to Dataset, using df.as[SomeCaseClass]. Just be aware that its somewhat early days for Dataset. On Mon, Feb 22, 2016 at 6:45 PM, Kevin Mellott

Re: Using functional programming rather than SQL

2016-02-22 Thread Dean Wampler
Kevin gave you the answer you need, but I'd like to comment on your subject line. SQL is a limited form of FP. Sure, there are no anonymous functions and other limitations, but it's declarative, like good FP programs should be, and it offers an important subset of the operators ("combinators") you

Re: Using functional programming rather than SQL

2016-02-22 Thread Kevin Mellott
In your example, the *rs* instance should be a DataFrame object. In other words, the result of *HiveContext.sql* is a DataFrame that you can manipulate using *filter, map, *etc. http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.hive.HiveContext On Mon, Feb 22, 2016

Variable performance in Spark threads

2016-02-22 Thread alberto.scolari
Hi everybody, I am running a Spark job with multiple Map-Reduce iterations on a cluster with multi-core machines. Inside each machine, I observe variable performance, with some threads taking 20% more time than others (within the same machine). I checked that the input size is the same for all the

Re: Serializing collections in Datasets

2016-02-22 Thread Koert Kuipers
it works in 2.0.0-SNAPSHOT On Mon, Feb 22, 2016 at 6:24 PM, Michael Armbrust wrote: > I think this will be fixed in 1.6.1. Can you test when we post the first > RC? (hopefully later today) > > On Mon, Feb 22, 2016 at 1:51 PM, Daniel Siegmann < >

Re: Serializing collections in Datasets

2016-02-22 Thread Michael Armbrust
I think this will be fixed in 1.6.1. Can you test when we post the first RC? (hopefully later today) On Mon, Feb 22, 2016 at 1:51 PM, Daniel Siegmann < daniel.siegm...@teamaol.com> wrote: > Experimenting with datasets in Spark 1.6.0 I ran into a serialization > error when using case classes

Using functional programming rather than SQL

2016-02-22 Thread Mich Talebzadeh
Hi, I have data stored in Hive tables that I want to do simple manipulation. Currently in Spark I perform the following with getting the result set using SQL from Hive tables, registering as a temporary table in Spark Now Ideally I can get the result set into a DF and work on DF to slice

Re: Streaming mapWithState API has NullPointerException

2016-02-22 Thread Aris
If I build from git branch origin/branch-1.6 will I be OK to test out my code? Thank you so much TD! Aris On Mon, Feb 22, 2016 at 2:48 PM, Tathagata Das wrote: > There were a few bugs that were solved with mapWithState recently. Would > be available in 1.6.1 (RC

Re: Streaming mapWithState API has NullPointerException

2016-02-22 Thread Tathagata Das
There were a few bugs that were solved with mapWithState recently. Would be available in 1.6.1 (RC to be cut soon). On Mon, Feb 22, 2016 at 5:29 PM, Aris wrote: > Hello Spark community, and especially TD and Spark Streaming folks: > > I am using the new Spark 1.6.0

Streaming mapWithState API has NullPointerException

2016-02-22 Thread Aris
Hello Spark community, and especially TD and Spark Streaming folks: I am using the new Spark 1.6.0 Streaming mapWithState API, in order to accomplish a streaming joining task with data. Things work fine on smaller sets of data, but on a single-node large cluster with JSON strings amounting to

DirectFileOutputCommiter

2016-02-22 Thread igor.berman
Hi, Wanted to understand if anybody uses DirectFileOutputCommitter or alikes especially when working with s3? I know that there is one impl in spark distro for parquet format, but not for files - why? Imho, it can bring huge performance boost. Using default FileOutputCommiter with s3 has big

Serializing collections in Datasets

2016-02-22 Thread Daniel Siegmann
Experimenting with datasets in Spark 1.6.0 I ran into a serialization error when using case classes containing a Seq member. There is no problem when using Array instead. Nor is there a problem using RDD or DataFrame (even if converting the DF to a DS later). Here's an example you can test in the

Re: Newbie questions regarding log processing

2016-02-22 Thread Philippe de Rochambeau
Thank you to you both, Jorge and Mich. You've answered my questions in a quasi-realtime manner! I will look into Flume and HDFS. > Le 22 févr. 2016 à 22:41, Jorge Machado a écrit : > > To Get the that you could use Flume to ship the logs from the Servers to the > HDFS for

Re: Newbie questions regarding log processing

2016-02-22 Thread Teng Qiu
woow, great post, very detailed, question is that, what kind of "web logs" do they have, if those logs are some application logs, like apache httpd logs or oracle logs, then, sure, this is a typical use cases for spark or generally, for hadoop tech stack. but if Philippe is talking about network

Re: Newbie questions regarding log processing

2016-02-22 Thread Jorge Machado
To Get the that you could use Flume to ship the logs from the Servers to the HDFS for example and to streaming on it. Check this : http://spark.apache.org/docs/latest/streaming-flume-integration.html and

Re: Newbie questions regarding log processing

2016-02-22 Thread Mich Talebzadeh
Hi, There are a number of options here. You first point of call would be to store these logs that come in from the source on HDFS directory as time series entries. I assume the logs will be in textual format and will be compressed (gzip, bzip2 etc).They can be stored individually and you

Re: Spark Streaming not reading input coming from the other ip

2016-02-22 Thread Vinti Maheshwari
Thanks Shixiong, I am not getting any error and telnet is also working fine. $ telnet Trying 192.168.186.97... Connected to ttsv- On Mon, Feb 22, 2016 at 1:10 PM, Shixiong(Ryan) Zhu wrote: > What's the error info reported by Streaming? And could you use

Newbie questions regarding log processing

2016-02-22 Thread Philippe de Rochambeau
Hello, I have a few newbie questions regarding Spark. Is Spark a good tool to process Web logs for attacks (or is it better to used a more specialized tool)? If so, are there any plugins for this purpose? Can you use Spark to weed out huge logs and extract only suspicious activities; e.g., 1000

Re: Spark Streaming not reading input coming from the other ip

2016-02-22 Thread Shixiong(Ryan) Zhu
What's the error info reported by Streaming? And could you use "telnet" to test if the network is normal? On Mon, Feb 22, 2016 at 6:59 AM, Vinti Maheshwari wrote: > For reference, my program: > > def main(args: Array[String]): Unit = { > val conf = new

Re: Stateful Operation on JavaPairDStream Help Needed !!

2016-02-22 Thread Shixiong(Ryan) Zhu
Hey Abhi, Using reducebykeyandwindow and mapWithState will trigger the bug in SPARK-6847. Here is a workaround to trigger checkpoint manually: JavaMapWithStateDStream<...> stateDStream = myPairDstream.mapWithState(StateSpec.function(mappingFunc)); stateDStream.foreachRDD(new

Re: Stateful Operation on JavaPairDStream Help Needed !!

2016-02-22 Thread Shixiong(Ryan) Zhu
Hey, Ted, As the fix for SPARK-6847 changes the semantics of Streaming checkpointing, it doesn't go into branch 1.6. A workaround is calling `count` to trigger the checkpoint manually. Such as, val dstream = ... // dstream is an operator needing to be checkpointed. dstream.foreachRDD(rdd =>

Re: Left/Right Outer join on multiple Columns

2016-02-22 Thread Abhisheks
did you try this - DataFrame joinedDf_intersect = leftDf.select("x", "y", "z") .join(rightDf,leftDf.col("x").equalTo(rightDf.col("x")) .and(leftDf.col("y").equalTo(rightDf.col("y"))), "left_outer") ; Hope that helps. On Mon, Feb 22, 2016 at 12:22 PM, praneshvyas [via Apache Spark User List] <

Re: Stateful Operation on JavaPairDStream Help Needed !!

2016-02-22 Thread Ted Yu
Fix for SPARK-6847 is not in branch-1.6 Should the fix be ported to branch-1.6 ? Thanks > On Feb 22, 2016, at 11:55 AM, Shixiong(Ryan) Zhu > wrote: > > Hey Abhi, > > Could you post how you use mapWithState? By default, it should do > checkpointing every 10 batches.

Re: Stateful Operation on JavaPairDStream Help Needed !!

2016-02-22 Thread Abhishek Anand
Hi Ryan, Reposting the code. Basically my use case is something like - I am receiving the web impression logs and may get the notify (listening from kafka) for those impressions in the same interval (for me its 1 min) or any next interval (upto 2 hours). Now, when I receive notify for a

Re: Spark Job Hanging on Join

2016-02-22 Thread Dave Moyers
Good article! Thanks for sharing! > On Feb 22, 2016, at 11:10 AM, Davies Liu wrote: > > This link may help: > https://forums.databricks.com/questions/6747/how-do-i-get-a-cartesian-product-of-a-huge-dataset.html > > Spark 1.6 had improved the CatesianProduct, you should

Re: Option[Long] parameter in case class parsed from JSON DataFrame failing when key not present in JSON

2016-02-22 Thread Jorge Machado
Hi Anthony, I try the code on my self. I think it is on the jsonStr: I do it with : val jsonStr = """{"customer_id": "3ee066ab571e03dd5f3c443a6c34417a","product_id": 3}”"" or is it the “,” after your 3 oder the “\n” Regards > On 22/02/2016, at 15:42, Anthony Brew

Re: Read from kafka after application is restarted

2016-02-22 Thread Chitturi Padma
Hi Vaibhav, Please try with Kafka direct API approach. Is this not working ? -- Padma Ch On Tue, Feb 23, 2016 at 12:36 AM, vaibhavrtk1 [via Apache Spark User List] < ml-node+s1001560n26291...@n3.nabble.com> wrote: > Hi > > I am using kafka with spark streaming 1.3.0 . When the spark

Re: Stateful Operation on JavaPairDStream Help Needed !!

2016-02-22 Thread Shixiong(Ryan) Zhu
Hey Abhi, Could you post how you use mapWithState? By default, it should do checkpointing every 10 batches. However, there is a known issue that prevents mapWithState from checkpointing in some special cases: https://issues.apache.org/jira/browse/SPARK-6847 On Mon, Feb 22, 2016 at 5:47 AM,

Re: Read from kafka after application is restarted

2016-02-22 Thread Cody Koeninger
The direct stream will let you do both of those things. Is there a reason you want to use receivers? http://spark.apache.org/docs/latest/streaming-kafka-integration.html http://spark.apache.org/docs/latest/configuration.html#spark-streaming look for maxRatePerPartition On Mon, Feb 22, 2016

Read from kafka after application is restarted

2016-02-22 Thread vaibhavrtk1
Hi I am using kafka with spark streaming 1.3.0 . When the spark application is not running kafka is still receiving messages. When i start the application those messages which have already been received when spark was not running are not processed. I am using a unreliable receiver based approach.

Re: map operation clears custom partitioner

2016-02-22 Thread Silvio Fiorito
You can use mapValues to ensure partitioning is not lost. From: Brian London > Date: Monday, February 22, 2016 at 1:21 PM To: user > Subject: map operation clears custom partitioner It

Re: map operation clears custom partitioner

2016-02-22 Thread Sean Owen
The problem is that your new mapped values may be in the wrong partition, according to your partitioner. Look for methods that have a preservesPartitioning flag, which is a way to indicate that you know the partitioning remains correct. (Like, you partition by keys and didn't change the keys in

map operation clears custom partitioner

2016-02-22 Thread Brian London
It appears that when a custom partitioner is applied in a groupBy operation, it is not propagated through subsequent non-shuffle operations. Is this intentional? Is there any way to carry custom partitioning through maps? I've uploaded a gist that exhibits the behavior.

Re: Does Spark satisfy my requirements?

2016-02-22 Thread Chitturi Padma
Hi, When you say that you want to produce new information, are you looking forward to put the processed data in other consumers ? Spark will be definitely the choice for real-time streaming computations. Are you looking for near-real time processing or exactly real-time processing ? On Sun, Feb

Re: Spark Job Hanging on Join

2016-02-22 Thread Davies Liu
This link may help: https://forums.databricks.com/questions/6747/how-do-i-get-a-cartesian-product-of-a-huge-dataset.html Spark 1.6 had improved the CatesianProduct, you should turn of auto broadcast and go with CatesianProduct in 1.6 On Mon, Feb 22, 2016 at 1:45 AM, Mohannad Ali

DataFrame and char encoding

2016-02-22 Thread jdkorigan
Hi, I'm trying to find a solution to display string with accents or special char(latin-1). Is there a way to create a DataFrame with a special char encoding? fields = [StructField(StructField("subject", StringType(), True)] schema = StructType(fields) DF = sqlContext.createDataFrame(RDD,

Re: Can we load csv partitioned data into one DF?

2016-02-22 Thread Mich Talebzadeh
Indeed this will work. Additionally the files could be zipped as well (gz or bzip2) val df = sqlContext.read.format("com.databricks.spark.csv").option("inferSchema", "true").option("header", "true").load("/data/stg") On 22/02/2016 15:32, Alex Dzhagriev wrote: > Hi Saif, > > You can put

Re: Spark Cache Eviction

2016-02-22 Thread Ted Yu
Please see SPARK-1762 Add functionality to pin RDDs in cache On Mon, Feb 22, 2016 at 6:43 AM, Pietro Gentile < pietro.gentile89.develo...@gmail.com> wrote: > Hi all, > > Is there a way to prevent eviction of the RDD from SparkContext ? > I would not use the cache with its default behavior

Can we load csv partitioned data into one DF?

2016-02-22 Thread Saif.A.Ellafi
Hello all, I am facing a silly data question. If I have +100 csv files which are part of the same data, but each csv is for example, a year on a timeframe column (i.e. partitioned by year), what would you suggest instead of loading all those files and joining them? Final target would be

Re: Can we load csv partitioned data into one DF?

2016-02-22 Thread Alex Dzhagriev
Hi Saif, You can put your files into one directory and read it as text. Another option is to read them separately and then union the datasets. Thanks, Alex. On Mon, Feb 22, 2016 at 4:25 PM, wrote: > Hello all, I am facing a silly data question. > > If I have +100

Re: an error when I read data from parquet

2016-02-22 Thread Jorge Sánchez
Hi Alex, it seems there is a problem with Spark Notebook, I suggest you follow the issue there (Or you could try Apache Zeppelin or Spark-Shell directly if notebooks are not a requirement): https://github.com/andypetrella/spark-notebook/issues/380 Regards. 2016-02-19 12:59 GMT+00:00

Re: SparkListener onApplicationEnd processing an RDD throws exception because of stopped SparkContext

2016-02-22 Thread Sumona Routh
Ok, I understand. Yes, I will have to handle them in the main thread. Thanks! Sumona On Wed, Feb 17, 2016 at 12:24 PM Shixiong(Ryan) Zhu wrote: > `onApplicationEnd` is posted when SparkContext is stopping, and you cannot > submit any job to a stopping SparkContext.

Re: 回复: a new FileFormat 5x~100x faster than parquet

2016-02-22 Thread Ted Yu
The referenced benchmark is in Chinese. Please provide English version so that more people can understand. For item 7, looks like the speed of ingest is much slower compared to using Parquet. Cheers On Mon, Feb 22, 2016 at 6:12 AM, 开心延年 wrote: > 1.ya100 is not only the

Re: Spark Streaming not reading input coming from the other ip

2016-02-22 Thread Vinti Maheshwari
For reference, my program: def main(args: Array[String]): Unit = { val conf = new SparkConf().setAppName("HBaseStream") val sc = new SparkContext(conf) // create a StreamingContext, the main entry point for all streaming functionality val ssc = new StreamingContext(sc, Seconds(2))

Option[Long] parameter in case class parsed from JSON DataFrame failing when key not present in JSON

2016-02-22 Thread Anthony Brew
Hi, I'm trying to parse JSON data into a case class using the DataFrame.as[] function, nut I am hitting an unusual error and the interweb isnt solving my pain so thought I would reach out for help. Ive truncated my code a little here to make it readable, but the error is full My case class

Spark Cache Eviction

2016-02-22 Thread Pietro Gentile
Hi all, Is there a way to prevent eviction of the RDD from SparkContext ? I would not use the cache with its default behavior (LRU). I would unpersist manually RDD cached in memory/disk. Thanks in advance, Pietro. Questa e-mail è stata inviata da un computer privo di virus protetto da Avast.

Spark Streaming not reading input coming from the other ip

2016-02-22 Thread Vinti Maheshwari
Hi I am in spark Streaming context, and i am reading input from the the socket using nc -lk . When i am running it and manually giving input it's working. But, if input is coming from different ip to this socket then spark is not reading that input, though it's showing all the input coming

java.io.IOException: java.lang.reflect.InvocationTargetException on new spark machines

2016-02-22 Thread Abhishek Anand
Hi , I am getting the following exception on running my spark streaming job. The same job has been running fine since long and when I added two new machines to my cluster I see the job failing with the following exception. 16/02/22 19:23:01 ERROR Executor: Exception in task 2.0 in stage

an OOM while persist as DISK_ONLY

2016-02-22 Thread Alex Dzhagriev
Hello all, I'm using spark 1.6 and trying to cache a dataset which is 1.5 TB, I have only ~800GB RAM in total, so I am choosing the DISK_ONLY storage level. Unfortunately, I'm getting out of the overhead memory limit: Container killed by YARN for exceeding memory limits. 27.0 GB of 27 GB

?????? ?????? a new FileFormat 5x~100x faster than parquet

2016-02-22 Thread ????????
1.ya100 is not only the invert index ,but also include the TOP N sort lazy read,also include label . 2.our test on ya100 and parquet is on this link address https://github.com/ycloudnet/ya100/blob/master/v1.0.8/ya100%E6%80%A7%E8%83%BD%E6%B5%8B%E8%AF%95%E6%8A%A5%E5%91%8A.docx?raw=true 3.you are

Re: Sample project on Image Processing

2016-02-22 Thread ndjido
Hi folks, KeystoneML has some image processing features: http://keystone-ml.org/examples.html Cheers, Ardo Sent from my iPhone > On 22 Feb 2016, at 14:34, Sainath Palla wrote: > > Here is one simple example of Image classification in Java. > >

Re: Stateful Operation on JavaPairDStream Help Needed !!

2016-02-22 Thread Abhishek Anand
Any Insights on this one ? Thanks !!! Abhi On Mon, Feb 15, 2016 at 11:08 PM, Abhishek Anand wrote: > I am now trying to use mapWithState in the following way using some > example codes. But, by looking at the DAG it does not seem to checkpoint > the state and when

Re: Accessing Web UI

2016-02-22 Thread Vasanth Bhat
Thanks a lot robin. Doing a search on Goolge, seems to indicate that I need to control the minThreads and maxThreds for Threadpool through jetty.xml But I am not able to find the jetty.xml in the spark installation. Thanks Vasanth On Mon, Feb 22, 2016 at 5:43 PM, Robin East

Re: Sample project on Image Processing

2016-02-22 Thread Sainath Palla
Here is one simple example of Image classification in Java. http://blogs.quovantis.com/image-classification-using-apache-spark-with-linear-svm/ Personally, I feel python provides better libraries for image processing. But it mostly depends on what kind of Image processing you are doing. If you

Re: 回复: a new FileFormat 5x~100x faster than parquet

2016-02-22 Thread Gavin Yue
I recommend you provide more information. Using inverted index certainly speed up the query time if hitting the index, but it would take longer to create and insert. Is the source code not available at this moment? Thanks Gavin > On Feb 22, 2016, at 20:27, 开心延年 wrote:

How to add a typesafe config file which is located on HDFS to spark-submit (cluster-mode)?

2016-02-22 Thread Jobs
Hi, I have a Spark (Spark 1.5.2) application that streams data from Kafka to HDFS. My application contains two Typesafe config files to configure certain things like Kafka topic etc. Now I want to run my application with spark-submit (cluster mode) in a cluster. The jar file with all

Re: [Cassandra-Connector] No Such Method Error despite correct versions

2016-02-22 Thread Jan Algermissen
Doh - minutes after my question I saw the same from a couple of days ago Indeed, using C* driver 3.0.0-rc1 seems to solve the issue Jan > On 22 Feb 2016, at 12:13, Jan Algermissen wrote: > > Hi, > > I am using > > Cassandra 2.1.5 > Spark 1.5.2 > Cassandra

Re: Accessing Web UI

2016-02-22 Thread Vasanth Bhat
The port 4040 is *not* used. No process is listening on 4040. As per the logs, 8080 is used for WebUI. The log mentions the below 16/02/19 03:07:32 INFO Utils: Successfully started service 'MasterUI' on port 8080. 16/02/19 03:07:32 INFO MasterWebUI: Started MasterWebUI at

[Cassandra-Connector] No Such Method Error despite correct versions

2016-02-22 Thread Jan Algermissen
Hi, I am using Cassandra 2.1.5 Spark 1.5.2 Cassandra java-drive 3.0.0 Cassandra-Connector 1.5.0-RC1 All with scala 2.11.7 Nevertheless, I get the following error from my Spark job: java.lang.NoSuchMethodError: com.datastax.driver.core.TableMetadata.getIndexes()Ljava/util/List; at

Re: Sample project on Image Processing

2016-02-22 Thread Akhil Das
What type of Image processing are you doing? Here's a simple example with Tensorflow https://databricks.com/blog/2016/01/25/deep-learning-with-spark-and-tensorflow.html Thanks Best Regards On Mon, Feb 22, 2016 at 1:53 PM, Mishra, Abhishek wrote: > Hello, > > I am

Re: Accessing Web UI

2016-02-22 Thread Kayode Odeyemi
Try http://localhost:4040 On Mon, Feb 22, 2016 at 8:23 AM, Vasanth Bhat wrote: > Thanks Gourav, Eduardo > > I tried http://localhost:8080 and http://OAhtvJ5MCA:8080/ . Both > cases the forefox just hangs. > > Also I tried with lynx text based browser. I get the

Re: Kafka streaming receiver approach - new topic not read from beginning

2016-02-22 Thread Paul Leclercq
Thanks for your quick answer. If I set "auto.offset.reset" to "smallest" as for KafkaParams like this val kafkaParams = Map[String, String]( "metadata.broker.list" -> brokers, "group.id" -> groupId, "auto.offset.reset" -> "smallest" ) And then use : val streams =

Re: [Example] : read custom schema from file

2016-02-22 Thread Akhil Das
If you are talking about a CSV kind of file, then here's an example http://spark.apache.org/docs/latest/sql-programming-guide.html#inferring-the-schema-using-reflection Thanks Best Regards On Mon, Feb 22, 2016 at 1:10 PM, Divya Gehlot wrote: > Hi, > Can anybody help me

Re: Kafka streaming receiver approach - new topic not read from beginning

2016-02-22 Thread Saisai Shao
You could set this configuration "auto.offset.reset" through parameter "kafkaParams" which is provided in some other overloaded APIs of createStream. By default Kafka will pick data from latest offset unless you explicitly set it, this is the behavior Kafka, not Spark. Thanks Saisai On Mon, Feb

Kafka streaming receiver approach - new topic not read from beginning

2016-02-22 Thread Paul Leclercq
Hi, Do you know why, with the receiver approach and a *consumer group*, a new topic is not read from the beginning but from the lastest ? Code example : val kafkaStream =

Re: spark.driver.maxResultSize doesn't work in conf-file

2016-02-22 Thread Mohannad Ali
In spark-defaults you put the values like "spark.driver.maxResultSize 0" instead of "spark.driver.maxResultSize=0" I think. On Sat, Feb 20, 2016 at 3:40 PM, AlexModestov wrote: > I have a string spark.driver.maxResultSize=0 in the spark-defaults.conf. > But I get

Re: Spark Job Hanging on Join

2016-02-22 Thread Mohannad Ali
Hello everyone, I'm working with Tamara and I wanted to give you guys an update on the issue: 1. Here is the output of .explain(): > Project >

Re: [Example] : read custom schema from file

2016-02-22 Thread Mohannad Ali
Hello Divya, What kind of file? Best Regards, Mohannad On Mon, Feb 22, 2016 at 8:40 AM, Divya Gehlot wrote: > Hi, > Can anybody help me by providing me example how can we read schema of the > data set from the file. > > > > Thanks, > Divya >

Loading file into executor classpath

2016-02-22 Thread Amjad ALSHABANI
Hello everybody, I ve implemented a Loganalyzer program in spark, which takes the logs from an apache log file and translate it to a given object, The regex of the log file is GROK, so I m using GROK library to extract the desired field When running the application locally, it succeded without

Re: Constantly increasing Spark streaming heap memory

2016-02-22 Thread Robin East
Hi What you describe looks like normal behaviour for almost any Java/Scala application - objects are created on the heap until a limit point is reached and then GC clears away memory allocated to objects that are no longer referenced. Is there an issue you are experiencing? > On 21 Feb

Sample project on Image Processing

2016-02-22 Thread Mishra, Abhishek
Hello, I am working on image processing samples. Was wondering if anyone has worked on Image processing project in spark. Please let me know if any sample project or example is available. Please guide in this. Sincerely, Abhishek

Re: Spark SQL is not returning records for hive bucketed tables on HDP

2016-02-22 Thread Varadharajan Mukundan
Actually the auto compaction if enabled is triggered based on the volume of changes. It doesn't automatically run after every insert. I think its possible to reduce the thresholds but that might reduce performance by a big margin. As of now, we do compaction after the batch insert completes. The