Topic Modelling- LDA

2015-09-22 Thread Subshiri S
Hi, I am experimenting with Spark LDA. How do I create Topic Model for Prediction in Spark ? How do I evaluate the topics modelled in Spark ? Could you point some examples. Regards, Subshiri

Re: Yarn Shutting Down Spark Processing

2015-09-22 Thread Tathagata Das
Does your simple Spark batch jobs work in the same YARN setup? May be YARN is not able to provide resources that you are asking for. On Tue, Sep 22, 2015 at 5:49 PM, Bryan Jeffrey wrote: > Hello. > > I have a Spark streaming job running on a cluster managed by Yarn. The > spark streaming job st

Re: Spark as standalone or with Hadoop stack.

2015-09-22 Thread Sean Owen
Might be better for another list, but, I suspect it's more than HBase is simply much more integrated with YARN, and because it's run with other services that are as well. On Wed, Sep 23, 2015 at 12:02 AM, Jacek Laskowski wrote: > That sentence caught my attention. Could you explain the reasons fo

Re: How to get RDD from PairRDD in Java

2015-09-22 Thread Andy Huang
use .values() which will return an RDD of just values On Wed, Sep 23, 2015 at 4:24 PM, Zhang, Jingyu wrote: > Hi All, > > I want to extract the "value" RDD from PairRDD in Java > > Please let me know how can I get it easily. > > Thanks > > Jingyu > > > This message and its attachments may conta

How to get RDD from PairRDD in Java

2015-09-22 Thread Zhang, Jingyu
Hi All, I want to extract the "value" RDD from PairRDD in Java Please let me know how can I get it easily. Thanks Jingyu -- This message and its attachments may contain legally privileged or confidential information. It is intended solely for the named addressee. If you are not the address

Re: Why RDDs are being dropped by Executors?

2015-09-22 Thread Tathagata Das
If the RDD is not constantly in use, then the LRU scheme in each executor can kick out some of the partitions from memory. If you want to avoid recomputing in such cases, you could persist with StorageLevel.MEMORY_AND_DISK, where the partitions will dropped to disk when kicked from memory. That wil

Re: WAL on S3

2015-09-22 Thread Tathagata Das
Responses inline. On Tue, Sep 22, 2015 at 8:35 PM, Michal Čizmazia wrote: > Can checkpoints be stored to S3 (via S3/S3A Hadoop URL)? > > Yes. Because checkpoints are single files by itself, and does not require flush semantics to work. So S3 is fine. > Trying to answer this question, I looke

Re: Invalid checkpoint url

2015-09-22 Thread Tathagata Das
Bingo! That is the problem. The solution is now obvious I presume :) On Tue, Sep 22, 2015 at 9:41 PM, srungarapu vamsi wrote: > @Das, > No, i am getting in the cluster mode. > I think i understood why i am getting this error, please correct me if i > am wrong. > Reason is: > checkpointing writes

Re: Py4j issue with Python Kafka Module

2015-09-22 Thread Tathagata Das
SPARK_CLASSPATH is I believe deprecated right now. So I am not surprised that there is some difference in the code paths. On Tue, Sep 22, 2015 at 9:45 PM, Saisai Shao wrote: > I think it is something related to class loader, the behavior is different > for classpath and --jars. If you want to kn

Re: Streaming Receiver Imbalance Problem

2015-09-22 Thread Tathagata Das
Also, you could switch to the Direct KAfka API which was first released as experimental in 1.3. In 1.5 we graduated it from experimental, but its quite usable in Spark 1.3.1 TD On Tue, Sep 22, 2015 at 7:45 PM, SLiZn Liu wrote: > Cool, we are still sticking with 1.3.1, will upgrade to 1.5 ASAP.

How to make Group By/reduceByKey more efficient?

2015-09-22 Thread swetha
Hi, How to make Group By more efficient? Is it recommended to use a custom partitioner and then do a Group By? And can we use a custom partitioner and then use a reduceByKey for optimization? Thanks, Swetha -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com

Re: Partitions on RDDs

2015-09-22 Thread Yashwanth Kumar
HI, In the first rdd transformation (eg: reading from a file sc.textfile("path",partition)), the partition you specify will be applied to all further transformations and actions from this rdd. In few places repartitioning your rdd will give a added advantage. Repartition is usually done during act

Re: SparkR vs R

2015-09-22 Thread Yashwanth Kumar
Hi, 1. The main difference between SparkR and R is that "SparkR" can handle bigdata. Yes, you can use other core libraries inside SparkR(not algos like lm(),glm(),kmean()) 2.Yes, core R libraries will not be distributed. You can use function from these libraries which are applicabe for mapper ki

Re: spark-avro takes a lot time to load thousands of files

2015-09-22 Thread Deenar Toraskar
Daniel Can you elaborate why are you using a broadcast variable to concatenate many Avro files into a single ORC file. Look at wholetextfiles on Spark context. SparkContext.wholeTextFiles lets you read a directory containing multiple small text files, and returns each of them as (filename, conten

JdbcRDD Constructor

2015-09-22 Thread satish chandra j
HI All, JdbcRDD constructor has following parameters, *JdbcRDD *(SparkContext

Re: Creating BlockMatrix with java API

2015-09-22 Thread Pulasthi Supun Wickramasinghe
Hi Sabarish Thanks, that would indeed solve my problem Best Regards, Pulasthi On Wed, Sep 23, 2015 at 12:55 AM, Sabarish Sasidharan < sabarish.sasidha...@manthan.com> wrote: > Hi Pulasthi > > You can always use JavaRDD.rdd() to get the scala rdd. So in your case, > > new BlockMatrix(rdd.rdd(),

Re: Creating BlockMatrix with java API

2015-09-22 Thread Sabarish Sasidharan
Hi Pulasthi You can always use JavaRDD.rdd() to get the scala rdd. So in your case, new BlockMatrix(rdd.rdd(), 2, 2) should work. Regards Sab On Tue, Sep 22, 2015 at 10:50 PM, Pulasthi Supun Wickramasinghe < pulasthi...@gmail.com> wrote: > Hi Yanbo, > > Thanks for the reply. I thought i might

Re: Apache Spark job in local[*] is slower than regular 1-thread Python program

2015-09-22 Thread Jonathan Coveney
It's highly conceivable to be able to beat spark in performance on tiny data sets like this. That's not really what it has been optimized for. El martes, 22 de septiembre de 2015, juljoin escribió: > Hello, > > I am trying to figure Spark out and I still have some problems with its > speed, I ca

Re: Py4j issue with Python Kafka Module

2015-09-22 Thread Saisai Shao
I think it is something related to class loader, the behavior is different for classpath and --jars. If you want to know the details I think you'd better dig out some source code. Thanks Jerry On Tue, Sep 22, 2015 at 9:10 PM, ayan guha wrote: > I must have been gone mad :) Thanks for pointing i

Re: Invalid checkpoint url

2015-09-22 Thread srungarapu vamsi
@Das, No, i am getting in the cluster mode. I think i understood why i am getting this error, please correct me if i am wrong. Reason is: checkpointing writes rdd to disk, so this checkpointing happens on all workers. Whenever, spark has to read back the rdd , checkpoint directory should be reachab

Re: Scala Limitation - Case Class definition with more than 22 arguments

2015-09-22 Thread Andy Huang
Alternatively, I would suggest you looking at programmatically building the schema refer to http://spark.apache.org/docs/latest/sql-programming-guide.html#programmatically-specifying-the-schema Cheers Andy On Wed, Sep 23, 2015 at 2:07 PM, Ted Yu wrote: > Can you switch to 2.11 ? > > The follow

Re: Py4j issue with Python Kafka Module

2015-09-22 Thread ayan guha
I must have been gone mad :) Thanks for pointing it out. I downloaded 1.5.0 assembly jar and added it in SPARK_CLASSPATH. However, I am getting a new error now >>> kvs = KafkaUtils.createDirectStream(ssc,['spark'],{"metadata.broker.list":'l ocalhost:9092'}) __

Re: Scala Limitation - Case Class definition with more than 22 arguments

2015-09-22 Thread Ted Yu
Can you switch to 2.11 ? The following has been fixed in 2.11: https://issues.scala-lang.org/browse/SI-7296 Otherwise consider packaging related values into a case class of their own. On Tue, Sep 22, 2015 at 8:48 PM, satish chandra j wrote: > HI All, > Do we have any alternative solutions in S

Re: how to submit the spark job outside the cluster

2015-09-22 Thread Zhiliang Zhu
Hi Zhan, I really appreciate your help, I will do as that next.And on the local machine, no hadoop/spark needs to be installed, but only copied with the /etc/hadoop/conf... whether the information (for example IP, hostname etc) of local machine would be set in the conf files... Moreover, do yo

Scala Limitation - Case Class definition with more than 22 arguments

2015-09-22 Thread satish chandra j
HI All, Do we have any alternative solutions in Scala to avoid limitation in defining a Case Class having more than 22 arguments We are using Scala version 2.10.2, currently I need to define a case class with 37 arguments but getting an error as "*error: Implementation restriction: case classes ca

RE: Why is 1 executor overworked and other sit idle?

2015-09-22 Thread Chirag Dewan
Thanks Ted and Rich. So if I repartition my RDD programmatically and call coalesce on the RDD to 1 partition would that generate 1 output file? Ahh.. Is my coalesce operation causing 1 partition, hence 1 output file and 1 executor working on all the data? To summarize this is what I do :- 1)

Re: SparkR for accumulo

2015-09-22 Thread madhvi.gupta
Hi Rui, Cant we use the accumulo data RDD created from JAVA in spark, in sparkR? Thanks and Regards Madhvi Gupta On Tuesday 22 September 2015 04:42 PM, Sun, Rui wrote: I am afraid that there is no support for accumulo in SparkR now, because: 1. It seems that there is no data source support fo

Re: WAL on S3

2015-09-22 Thread Michal Čizmazia
Can checkpoints be stored to S3 (via S3/S3A Hadoop URL)? Trying to answer this question, I looked into Checkpoint.getCheckpointFiles [1]. It is doing findFirstIn which would probably be calling the S3 LIST operation. S3 LIST is prone to eventual consistency [2]. What would happen when getCheckpoin

Re: how to submit the spark job outside the cluster

2015-09-22 Thread Zhan Zhang
Hi Zhiliang, I cannot find a specific doc. But as far as I remember, you can log in one of your cluster machine, and find the hadoop configuration location, for example /etc/hadoop/conf, copy that directory to your local machine. Typically it has hdfs-site.xml, yarn-site.xml etc. In spark, the f

Re: how to submit the spark job outside the cluster

2015-09-22 Thread Zhiliang Zhu
Hi Zhan, Yes, I get it now. I have not ever deployed hadoop configuration locally, and do not find the specific doc, would you help provide the doc to do that... Thank you,Zhiliang On Wednesday, September 23, 2015 11:08 AM, Zhan Zhang wrote: There is no difference between running th

Re: how to submit the spark job outside the cluster

2015-09-22 Thread Zhan Zhang
There is no difference between running the client in or out of the client (assuming there is no firewall or network connectivity issue), as long as you have hadoop configuration locally. Here is the doc for running on yarn. http://spark.apache.org/docs/latest/running-on-yarn.html Thanks. Zhan

Re: how to submit the spark job outside the cluster

2015-09-22 Thread Zhiliang Zhu
Hi Zhan, Thanks very much for your help comment.I also view it would be similar to hadoop job submit, however, I was not deciding whether it is like that when it comes to spark.  Have you ever tried that for spark...Would you give me the deployment doc for hadoop and spark gateway, since this i

Re: Streaming Receiver Imbalance Problem

2015-09-22 Thread SLiZn Liu
Cool, we are still sticking with 1.3.1, will upgrade to 1.5 ASAP. Thanks for the tips, Tathagata! On Wed, Sep 23, 2015 at 10:40 AM Tathagata Das wrote: > A lot of these imbalances were solved in spark 1.5. Could you give that a > spin? > > https://issues.apache.org/jira/browse/SPARK-8882 > > On

Re: Streaming Receiver Imbalance Problem

2015-09-22 Thread Tathagata Das
A lot of these imbalances were solved in spark 1.5. Could you give that a spin? https://issues.apache.org/jira/browse/SPARK-8882 On Tue, Sep 22, 2015 at 12:17 AM, SLiZn Liu wrote: > Hi spark users, > > In our Spark Streaming app via Kafka integration on Mesos, we initialed 3 > receivers to rece

Parallel collection in driver programs

2015-09-22 Thread Andy Huang
Hi All, Would like know if anyone has experienced with parallel collection in the driver program. And, if there is actual advantage/disadvantage of doing so. E.g. With a collection of Jdbc connections and tables We have adapted our non-spark code which utilize parallel collection to the spark co

Re: Re: Spark standalone/Mesos on top of Ceph

2015-09-22 Thread Jerry Lam
Hi Sun, The issue with Ceph as the underlying file system for Spark is that you lose data locality. Ceph is not designed to have spark run directly on top of the OSDs. I know that cephfs provides data location information via hadoop compatible API. The last time I researched on this is that the in

Re: how to submit the spark job outside the cluster

2015-09-22 Thread Zhan Zhang
It should be similar to other hadoop jobs. You need hadoop configuration in your client machine, and point the HADOOP_CONF_DIR in spark to the configuration. Thanks Zhan Zhang On Sep 22, 2015, at 6:37 PM, Zhiliang Zhu mailto:zchl.j...@yahoo.com.INVALID>> wrote: Dear Experts, Spark job is run

Re: Long GC pauses with Spark SQL 1.3.0 and billion row tables

2015-09-22 Thread tridib
By skewed did you mean it's not distributed uniformly across partition? All of my columns are string and almost of same size. i.e. id1,field11,fields12 id2,field21,field22 -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Long-GC-pauses-with-Spark-SQL-1-3-0-

Re: Re: Spark standalone/Mesos on top of Ceph

2015-09-22 Thread fightf...@163.com
Hi Jerry Yeah, we managed to run and use ceph already in our few production environment, especially with OpenStack. The reason we want to use Ceph is that we aim to look for some workarounds for unified storage layer and the design concepts of ceph is quite catching. I am just interested i

how to submit the spark job outside the cluster

2015-09-22 Thread Zhiliang Zhu
Dear Experts, Spark job is running on the cluster by yarn. Since the job can be submited at the place on the machine from the cluster,however, I would like to submit the job from another machine which does not belong to the cluster.I know for this, hadoop job could be done by way of another ma

Re: Spark standalone/Mesos on top of Ceph

2015-09-22 Thread Jerry Lam
Do you have specific reasons to use Ceph? I used Ceph before, I'm not too in love with it especially when I was using the Ceph Object Gateway S3 API. There are some incompatibilities with aws s3 api. You really really need to try it because making the commitment. Did you managed to install it? On

Spark standalone/Mesos on top of Ceph

2015-09-22 Thread fightf...@163.com
Hi guys, Here is the info for Ceph : http://ceph.com/ We are investigating and using Ceph for distributed storage and monitoring, specifically interested in using Ceph as the underlied file system storage for spark. However, we had no experience for achiveing that. Any body has seen such pr

Re: WAL on S3

2015-09-22 Thread Tathagata Das
You can keep the checkpoints in the Hadoop-compatible file system and the WAL somewhere else using your custom WAL implementation. Yes, cleaning up the stuff gets complicated as it is not as easy as deleting off the checkpoint directory - you will have to clean up checkpoint directory as well as th

Yarn Shutting Down Spark Processing

2015-09-22 Thread Bryan Jeffrey
Hello. I have a Spark streaming job running on a cluster managed by Yarn. The spark streaming job starts and receives data from Kafka. It is processing well and then after several seconds I see the following error: 15/09/22 14:53:49 ERROR yarn.ApplicationMaster: SparkContext did not initialize

Re: WAL on S3

2015-09-22 Thread Michal Čizmazia
My understanding of pluggable WAL was that it eliminates the need for having a Hadoop-compatible file system [1]. What is the use of pluggable WAL when it can be only used together with checkpointing which still requires a Hadoop-compatible file system? [1]: https://issues.apache.org/jira/browse/

Re: Why is 1 executor overworked and other sit idle?

2015-09-22 Thread Richard Eggert
If there's only one partition, by definition it will only be handled by one executor. Repartition to divide the work up. Note that this will also result in multiple output files, however. If you absolutely need them to be combined into a single file, I suggest using the Unix/Linux 'cat' command t

Re: Apache Spark job in local[*] is slower than regular 1-thread Python program

2015-09-22 Thread Richard Eggert
Maybe it's just my phone, but I don't see any code. On Sep 22, 2015 11:46 AM, "juljoin" wrote: > Hello, > > I am trying to figure Spark out and I still have some problems with its > speed, I can't figure them out. In short, I wrote two programs that loop > through a 3.8Gb file and filter each li

Re: WAL on S3

2015-09-22 Thread Tathagata Das
1. Currently, the WAL can be used only with checkpointing turned on, because it does not make sense to recover from WAL if there is not checkpoint information to recover from. 2. Since the current implementation saves the WAL in the checkpoint directory, they share the fate -- if checkpoint direct

Re: WAL on S3

2015-09-22 Thread Michal Čizmazia
I am trying to use pluggable WAL, but it can be used only with checkpointing turned on. Thus I still need have a Hadoop-compatible file system. Is there something like pluggable checkpointing? Or can WAL be used without checkpointing? What happens when WAL is available but the checkpoint director

Re: HDP 2.3 support for Spark 1.5.x

2015-09-22 Thread Zhan Zhang
Hi Krishna, For the time being, you can download from upstream, and it should be running OK for HDP2.3. For hdp specific problem, you can ask in Hortonworks forum. Thanks. Zhan Zhang On Sep 22, 2015, at 3:42 PM, Krishna Sankar mailto:ksanka...@gmail.com>> wrote: Guys, * We have HDP 2.3

Re: Invalid checkpoint url

2015-09-22 Thread Tathagata Das
Are you getting this error in local mode? On Tue, Sep 22, 2015 at 7:34 AM, srungarapu vamsi wrote: > Yes, I tried ssc.checkpoint("checkpoint"), it works for me as long as i > don't use reduceByKeyAndWindow. > > When i start using "reduceByKeyAndWindow" it complains me with the error > "Exceptio

Re: Partitions on RDDs

2015-09-22 Thread Richard Eggert
In general, RDDs get partitioned automatically without programmer intervention. You generally don't need to worry about them unless you need to adjust the size/number of partitions or the partitioning scheme according to the needs of your application. Partitions get redistributed among nodes whene

Re: Spark as standalone or with Hadoop stack.

2015-09-22 Thread Jacek Laskowski
On Tue, Sep 22, 2015 at 10:03 PM, Ted Yu wrote: > To my knowledge, no one runs HBase on top of Mesos. Hi, That sentence caught my attention. Could you explain the reasons for not running HBase on Mesos, i.e. what makes Mesos inappropriate for HBase? Jacek -

SPARK_WORKER_INSTANCES was detected (set to '2')…This is deprecated in Spark 1.0+

2015-09-22 Thread Jacek Laskowski
Hi, This is for Spark 1.6.0-SNAPSHOT (SHA1 a96ba40f7ee1352288ea676d8844e1c8174202eb). I've been toying with Spark Standalone cluster and have the following file in conf/spark-env.sh: ➜ spark git:(master) ✗ cat conf/spark-env.sh SPARK_WORKER_CORES=2 SPARK_WORKER_MEMORY=2g # multiple Spark worke

HDP 2.3 support for Spark 1.5.x

2015-09-22 Thread Krishna Sankar
Guys, - We have HDP 2.3 installed just now. It comes with Spark 1.3.x. The current wisdom is that it will support the 1.4.x train (which is good, need DataFrame et al). - What is the plan to support Spark 1.5.x ? Can we install 1.5.0 on HDP 2.3 ? Or will Spark 1.5.x support be in HD

Partitions on RDDs

2015-09-22 Thread XIANDI
I'm always confused by the partitions. We may have many RDDs in the code. Do we need to partition on all of them? Do the rdds get rearranged among all the nodes whenever we do a partition? What is a wise way of doing partitions? -- View this message in context: http://apache-spark-user-list.100

Re: How to share memory in a broadcast between tasks in the same executor?

2015-09-22 Thread Deenar Toraskar
Clement In local mode all worker threads run in the driver VM. Your dictionary should not be copied 32 times, in fact it wont be broadcast at all. Have you tried increasing spark.driver.memory to ensure that the driver uses all the memory on the machine. Deenar On 22 September 2015 at 19:42, Clé

Re: Spark 1.5 UDAF ArrayType

2015-09-22 Thread Deenar Toraskar
Michael Thank you for your prompt answer. I will repost after I try this again on 1.5.1 or branch-1.5. In addition a blog post on SparkSQL data types would be very helpful. I am familiar with the Hive data types, but there is very little documentation on Spark SQL data types. Regards Deenar On 2

KafkaProducer using Cassandra as source

2015-09-22 Thread kali.tumm...@gmail.com
Hi All, I am new bee in spark. I managed to write up kafka prodcuder in spark where data comes from Cassandra table but I have few questions. Spark data output from Cassandra looks like below. [2,Joe,Smith] [1,Barack,Obama] I would like something like this in my kafka consumer, so need to rem

Re: Spark as standalone or with Hadoop stack.

2015-09-22 Thread Ted Yu
bq. it's relatively harder to use it with HBase I agree with Sean. I work on HBase. To my knowledge, no one runs HBase on top of Mesos. On Tue, Sep 22, 2015 at 12:31 PM, Sean Owen wrote: > Who told you Mesos would make Spark 100x faster? does it make sense > that just the resource manager could

Re: Spark 1.5 UDAF ArrayType

2015-09-22 Thread Michael Armbrust
Check out: http://spark.apache.org/docs/latest/sql-programming-guide.html#data-types On Tue, Sep 22, 2015 at 12:49 PM, Deenar Toraskar < deenar.toras...@thinkreactive.co.uk> wrote: > Michael > > Thank you for your prompt answer. I will repost after I try this again on > 1.5.1 or branch-1.5. In ad

unsubscribe

2015-09-22 Thread Stuart Layton
-- Stuart Layton

Re: Deploying spark-streaming application on production

2015-09-22 Thread Adrian Tanase
btw I re-read the docs and I want to clarify that reliable receiver + WAL gives you at least once, not exactly once semantics. Sent from my iPhone On 21 Sep 2015, at 21:50, Adrian Tanase mailto:atan...@adobe.com>> wrote: I'm wondering, isn't this the canonical use case for WAL + reliable recei

Re: Spark as standalone or with Hadoop stack.

2015-09-22 Thread Sean Owen
Who told you Mesos would make Spark 100x faster? does it make sense that just the resource manager could make that kind of difference? This sounds entirely wrong, or, maybe a mishearing. I don't know if Mesos is somehow easier to use with Cassandra, but it's relatively harder to use it with HBase,

Spark as standalone or with Hadoop stack.

2015-09-22 Thread Shiv Kandavelu
Hi All, We currently have a Hadoop cluster having Yarn as the resource manager. We are planning to use HBase as the data store due to the C-P aspects of the CAP Theorem. We now want to do extensive data processing both stored data in HBase as well as Steam processing from online website / API W

pyspark question: create RDD from csr_matrix

2015-09-22 Thread jeff saremi
i've tried desperately to create an RDD from a matrix i have. Every combination failed. I have a sparse matrix returned from a call to dv = DictVectorizer()sv_tf = dv.fit_transform(tf) which is supposed to be a matrix of document terms and their frequencies. I need to convert this to an

Re: How to share memory in a broadcast between tasks in the same executor?

2015-09-22 Thread Utkarsh Sengar
If broadcast variable doesn't fit in memory, I think is not the right fit for you. You can think about fitting it with an RDD as a tuple with other data you are working on. Say you are working on RDD (rdd in your case), run a map/reduce to convert it to RDD> so now you have relevant data from the

How to share memory in a broadcast between tasks in the same executor?

2015-09-22 Thread Clément Frison
Hello, My team and I have a 32-core machine and we would like to use a huge object - for example a large dictionary - in a map transformation and use all our cores in parallel by sharing this object among some tasks. We broadcast our large dictionary. dico_br = sc.broadcast(dico) We use it in a

Re: Spark 1.5 UDAF ArrayType

2015-09-22 Thread Michael Armbrust
I think that you are hitting a bug (which should be fixed in Spark 1.5.1). I'm hoping we can cut an RC for that this week. Until then you could try building branch-1.5. On Tue, Sep 22, 2015 at 11:13 AM, Deenar Toraskar wrote: > Hi > > I am trying to write an UDAF ArraySum, that does element wis

Spark 1.5 UDAF ArrayType

2015-09-22 Thread Deenar Toraskar
Hi I am trying to write an UDAF ArraySum, that does element wise sum of arrays of Doubles returning an array of Double following the sample in https://databricks.com/blog/2015/09/16/spark-1-5-dataframe-api-highlights-datetimestring-handling-time-intervals-and-udafs.html. I am getting the following

Re: How to speed up MLlib LDA?

2015-09-22 Thread Marko Asplund
How optimized are the Commons math3 methods that showed up in profiling? Are there any higher performance alternatives to these? marko

Re: Has anyone used the Twitter API for location filtering?

2015-09-22 Thread Jo Sunad
Thanks Akhil, but I can't seem to get any tweets that include location data. For example, when I do stream.filter(status => status.getPlace().getName) and run the code for 20 minutes I only get null values.It seems like Twitter might purposely be removing the Place for free users? On Tue, Sep 22

Re: How to speed up MLlib LDA?

2015-09-22 Thread Charles Earl
It seems that the Vowpal Wabbit version is most similar to what is in https://github.com/intel-analytics/TopicModeling/blob/master/src/main/scala/org/apache/spark/mllib/topicModeling/OnlineHDP.scala Although the Intel seems to implement the Hierarchical Dirichlet Process (topics and subtopics) as

Re: spark + parquet + schema name and metadata

2015-09-22 Thread Cheng Lian
Michael reminded me that although we don't support direct manipulation over Parquet metadata, you can still save/query metadata to/from Parquet via DataFrame per-column metadata. For example: import sqlContext.implicits._ import org.apache.spark.sql.types.MetadataBuilder val path = "file:///tm

Re: Count for select not matching count for group by

2015-09-22 Thread Michael Armbrust
This looks like something is wrong with predicate pushdown. Can you include the output of calling explain, and tell us what format the data is stored in? On Mon, Sep 21, 2015 at 8:06 AM, Michael Kelly wrote: > Hi, > > I'm seeing some strange behaviour with spark 1.5, I have a dataframe > that I

Re: How to speed up MLlib LDA?

2015-09-22 Thread Pedro Rodriguez
I helped some with the LDA and worked quite a bit on a Gibbs version. I don't know if the Gibbs version might help, but since it is not (yet) in MLlib, Intel Analytics kindly created a spark package with their adapted version plus a couple other LDA algorithms: http://spark-packages.org/package/int

Re: Creating BlockMatrix with java API

2015-09-22 Thread Pulasthi Supun Wickramasinghe
Hi Yanbo, Thanks for the reply. I thought i might be missing something. Anyway i moved to using scala since it is the complete API. Best Regards, Pulasthi On Tue, Sep 22, 2015 at 7:03 AM, Yanbo Liang wrote: > This is due to the distributed matrices like > BlockMatrix/RowMatrix/IndexedRowMatri

Re: SparkR - calling as.vector() with rdd dataframe causes error

2015-09-22 Thread Luciano Resende
localDF is a pure R data frame and as.vector will work with no problems, as for calling it in the SparkR objects, try calling collect before you call as.vector (or in your case, the algorithms), that should solve your problem. On Mon, Sep 21, 2015 at 8:48 AM, Ellen Kraffmiller < ellen.kraffmil...@

Re: spark + parquet + schema name and metadata

2015-09-22 Thread Cheng Lian
I see, this makes sense. We should probably add this in Spark SQL. However, there's one corner case to note about user-defined Parquet metadata. When committing a write job, ParquetOutputCommitter writes Parquet summary files (_metadata and _common_metadata), and user-defined key-value metadat

Re: Remove duplicate keys by always choosing first in file.

2015-09-22 Thread Philip Weaver
The indices are definitely necessary. My first solution was just reduceByKey { case (v, _) => v } and that didn't work. I needed to look at both values and see which had the lower index. On Tue, Sep 22, 2015 at 8:54 AM, Sean Owen wrote: > The point is that this only works if you already knew the

Re: Py4j issue with Python Kafka Module

2015-09-22 Thread Saisai Shao
I think you're using the wrong version of kafka assembly jar, I think Python API from direct Kafka stream is not supported for Spark 1.3.0, you'd better change to version 1.5.0, looks like you're using Spark 1.5.0, why you choose Kafka assembly 1.3.0? D:\sw\spark-streaming-kafka-assembly_2.10-1.3

Re: Using Spark for portfolio manager app

2015-09-22 Thread Thúy Hằng Lê
That's great answer Andrian. I find a lots of information here. I have direction for application now, i will try your suggestion :) Vào Thứ Ba, ngày 22 tháng 9 năm 2015, Adrian Tanase đã viết: > >1. reading from kafka has exactly once guarantees - we are using it in >production today (wi

Re: Spark Web UI + NGINX

2015-09-22 Thread Ruslan Dautkhanov
It should be really simple to setup.. Check this Hue + NGINX setup page http://gethue.com/using-nginx-to-speed-up-hue-3-8-0/ In that config file change 1) > server_name NGINX_HOSTNAME; to "Machine A, with a public IP" 2) > server HUE_HOST1: max_fails=3; > server HUE_HOST2: max_fails=3; to

Re: Remove duplicate keys by always choosing first in file.

2015-09-22 Thread Sean Owen
The point is that this only works if you already knew the file was presented in order within and across partitions, which was the original problem anyway. I don't think it is in general, but in practice, I do imagine it's already in the expected order from textFile. Maybe under the hood this ends u

Apache Spark job in local[*] is slower than regular 1-thread Python program

2015-09-22 Thread juljoin
Hello, I am trying to figure Spark out and I still have some problems with its speed, I can't figure them out. In short, I wrote two programs that loop through a 3.8Gb file and filter each line depending of if a certain word is present. I wrote a one-thread python program doing the job and I obt

Re: Remove duplicate keys by always choosing first in file.

2015-09-22 Thread Philip Weaver
I have used the mapPartitionsWithIndex/zipWithIndex solution and so far it has done the correct thing. On Tue, Sep 22, 2015 at 8:38 AM, Adrian Tanase wrote: > just give zipWithIndex a shot, use it early in the pipeline. I think it > provides exactly the info you need, as the index is the origina

Re: Remove duplicate keys by always choosing first in file.

2015-09-22 Thread Adrian Tanase
just give zipWithIndex a shot, use it early in the pipeline. I think it provides exactly the info you need, as the index is the original line number in the file, not the index in the partition. Sent from my iPhone On 22 Sep 2015, at 17:50, Philip Weaver mailto:philip.wea...@gmail.com>> wrote:

Re: Remove duplicate keys by always choosing first in file.

2015-09-22 Thread Sean Owen
I don't know of a way to do this, out of the box, without maybe digging into custom InputFormats. The RDD from textFile doesn't have an ordering. I can't imagine a world in which partitions weren't iterated in line order, of course, but there's also no real guarantee about ordering among partitions

Re: Help getting started with Kafka

2015-09-22 Thread Yana Kadiyska
Thanks a lot Cody! I was punting on the decoders by calling count (or trying to, since my types require a custom decoder) but your sample code is exactly what I was trying to achieve. The error message threw me off, will work on the decoders now On Tue, Sep 22, 2015 at 10:50 AM, Cody Koeninger wr

RE: spark-avro takes a lot time to load thousands of files

2015-09-22 Thread java8964
Your performance problem sounds like in the driver, which is trying to boardcast 10k files by itself alone, which becomes the bottle neck. What you wants is just transfer the data from AVRO format per file to another format. In MR, most likely each mapper process one file, and you utilized the w

Re: Help getting started with Kafka

2015-09-22 Thread Cody Koeninger
You need type parameters for the call to createRDD indicating the type of the key / value and the decoder to use for each. See https://github.com/koeninger/kafka-exactly-once/blob/master/src/main/scala/example/BasicRDD.scala Also, you need to check to see if offsets 0 through 100 are still actua

Re: Remove duplicate keys by always choosing first in file.

2015-09-22 Thread Philip Weaver
Thanks. If textFile can be used in a way that preserves order, than both the partition index and the index within each partition should be consistent, right? I overcomplicated the question by asking about removing duplicates. Fundamentally I think my question is, how does one sort lines in a file

Error while saving parquet

2015-09-22 Thread gtinside
Please refer to the code snippet below . I get following error */tmp/temp/trade.parquet/part-r-00036.parquet is not a Parquet file. expected magic number at tail [80, 65, 82, 49] but found [20, -28, -93, 93] at parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:418) at org

Help getting started with Kafka

2015-09-22 Thread Yana Kadiyska
Hi folks, I'm trying to write a simple Spark job that dumps out a Kafka queue into HDFS. Being very new to Kafka, not sure if I'm messing something up on that side...My hope is to read the messages presently in the queue (or at least the first 100 for now) Here is what I have: Kafka side: ./bin/

Re: Invalid checkpoint url

2015-09-22 Thread srungarapu vamsi
Yes, I tried ssc.checkpoint("checkpoint"), it works for me as long as i don't use reduceByKeyAndWindow. When i start using "reduceByKeyAndWindow" it complains me with the error "Exception in thread "main" org.apache.spark.SparkException: Invalid checkpoint directory: file:/home/ubuntu/checkpoint/3

RE: Performance Spark SQL vs Dataframe API faster

2015-09-22 Thread Cheng, Hao
Yes, should be the same, as they are just different frontend, but the same thing in optimization / execution. -Original Message- From: sanderg [mailto:s.gee...@wimionline.be] Sent: Tuesday, September 22, 2015 10:06 PM To: user@spark.apache.org Subject: Performance Spark SQL vs Dataframe

spark on mesos gets killed by cgroups for too much memory

2015-09-22 Thread oggie
I'm using spark 1.2.2 on mesos 0.21 I have a java job that is submitted to mesos from marathon. I also have cgroups configured for mesos on each node. Even though the job, when running, uses 512MB, it tries to take over 3GB at startup and is killed by cgroups. When I start mesos-slave, It's sta

Performance Spark SQL vs Dataframe API faster

2015-09-22 Thread sanderg
Is there a difference in performance between writing a spark job using only SQL statements and writing it using the dataframe api or does it translate to the same thing under the hood? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Performance-Spark-SQL-vs-

Re: spark-avro takes a lot time to load thousands of files

2015-09-22 Thread Daniel Haviv
I Agree but it's a constraint I have to deal with. The idea is load these files and merge them into ORC. When using hive on Tez it takes less than a minute. Daniel > On 22 בספט׳ 2015, at 16:00, Jonathan Coveney wrote: > > having a file per record is pretty inefficient on almost any file system

Py4j issue with Python Kafka Module

2015-09-22 Thread ayan guha
Hi I have added spark assembly jar to SPARK CLASSPATH >>> print os.environ['SPARK_CLASSPATH'] D:\sw\spark-streaming-kafka-assembly_2.10-1.3.0.jar Now I am facing below issue with a test topic >>> ssc = StreamingContext(sc, 2) >>> kvs = KafkaUtils.createDirectStream(ssc,['spark'],{"metadata.br

  1   2   >