RE: Question about RDD cache, unpersist, materialization

2014-06-11 Thread Nick Pentreath
If you want to force materialization use .count() Also if you can simply don't unpersist anything, unless you really need to free the memory  — Sent from Mailbox On Wed, Jun 11, 2014 at 5:13 AM, innowireless TaeYun Kim taeyun@innowireless.co.kr wrote: BTW, it is possible that rdd.first()

Re: Spark Streaming not processing file with particular number of entries

2014-06-11 Thread praveshjain1991
Well i was able to get it to work by running spark over mesos. But it looks like a bug while running spark alone. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-not-processing-file-with-particular-number-of-entries-tp6694p7382.html Sent

Re: little confused about SPARK_JAVA_OPTS alternatives

2014-06-11 Thread elyast
Hi, I'm facing similar problem According to: http://tachyon-project.org/Running-Spark-on-Tachyon.html in order to allow tachyon client to connect to tachyon master in HA mode you need to pass 2 system properties: -Dtachyon.zookeeper.address=zookeeperHost1:2181,zookeeperHost2:2181

Re: Problem in Spark Streaming

2014-06-11 Thread nilmish
I used these commands to show the GC timings : -verbose:gc -XX:-PrintGCDetails -XX:+PrintGCTimeStamps Following is the output I got on the standard output : 4.092: [GC 4.092: [ParNew: 274752K-27199K(309056K), 0.0421460 secs] 274752K-27199K(995776K), 0.0422720 secs] [Times: user=0.33 sys=0.11,

Re: Problem in Spark Streaming

2014-06-11 Thread vinay Bajaj
http://stackoverflow.com/questions/895444/java-garbage-collection-log-messages http://stackoverflow.com/questions/16794783/how-to-read-a-verbosegc-output I think this will help in understanding the logs. On Wed, Jun 11, 2014 at 12:53 PM, nilmish nilmish@gmail.com wrote: I used these

Number of Spark streams in Yarn cluster

2014-06-11 Thread tnegi
Hi, I am trying to get a sense of number of streams we can process in parallel on a Spark streaming cluster(Hadoop Yarn). Is there any benchmark for the same? We need a large number of streams(original + transformed) to be processed in parallel. The number is approximately around= 30,.

Re: Spark Kafka streaming - ClassNotFoundException: org.apache.spark.streaming.kafka.KafkaReceiver

2014-06-11 Thread gaurav.dasgupta
Thanks Tobias for replying. The problem was that, I have to provide the dependency jars' paths to the StreamingContext within the code. So, providing all the jar paths, resolved my problem. Refer the below code snippet: *JavaStreamingContext ssc = new JavaStreamingContext(args[0],

Re: HDFS Server/Client IPC version mismatch while trying to access HDFS files using Spark-0.9.1

2014-06-11 Thread bijoy deb
Any suggestions from anyone? Thanks Bijoy On Tue, Jun 10, 2014 at 11:46 PM, bijoy deb bijoy.comput...@gmail.com wrote: Hi all, I have build Shark-0.9.1 using sbt using the below command: *SPARK_HADOOP_VERSION=2.0.0-mr1-cdh4.6.0 sbt/sbt assembly* My Hadoop cluster is also having version

Re: Hanging Spark jobs

2014-06-11 Thread Daniel Darabos
These stack traces come from the stuck node? Looks like it's waiting on data in BlockFetcherIterator. Waiting for data from another node. But you say all other nodes were done? Very curious. Maybe you could try turning on debug logging, and try to figure out what happens in BlockFetcherIterator (

Spark SQL incorrect result on GROUP BY query

2014-06-11 Thread Pei-Lun Lee
Hi, I am using spark 1.0.0 and found in spark sql some queries use GROUP BY give weird results. To reproduce, type the following commands in spark-shell connecting to a standalone server: case class Foo(k: String, v: Int) val sqlContext = new org.apache.spark.sql.SQLContext(sc) import

Normalizations in MLBase

2014-06-11 Thread Aslan Bekirov
Hi All, I have to normalize a set of values in the range 0-500 to the [0-1] range. Is there any util method in MLBase to normalize large set of data? BR, Aslan

RE: Spark SQL incorrect result on GROUP BY query

2014-06-11 Thread Cheng, Hao
That’s a good catch, but I think it’s suggested to use HiveContext currently. ( https://github.com/apache/spark/tree/master/sql) Catalyst$ sbt/sbt hive/console case class Foo(k: String, v: Int) val rows = List.fill(100)(Foo(a, 1)) ++ List.fill(200)(Foo(b, 2)) ++ List.fill(300)(Foo(c, 3))

Re: Information on Spark UI

2014-06-11 Thread Daniel Darabos
About more succeeded tasks than total tasks: - This can happen if you have enabled speculative execution. Some partitions can get processed multiple times. - More commonly, the result of the stage may be used in a later calculation, and has to be recalculated. This happens if some of the results

Re: Error During ReceivingConnection

2014-06-11 Thread Surendranauth Hiraman
It looks like this was due to another executor on a different node closing the connection on its side. I found the entries below in the remote side's logs. Can anyone comment on why one ConnectionManager would close its connection to another node and what could be tuned to avoid this? It did not

Re: Information on Spark UI

2014-06-11 Thread Shuo Xiang
Daniel, Thanks for the explanation. On Wed, Jun 11, 2014 at 8:57 AM, Daniel Darabos daniel.dara...@lynxanalytics.com wrote: About more succeeded tasks than total tasks: - This can happen if you have enabled speculative execution. Some partitions can get processed multiple times. - More

Re: Spark SQL incorrect result on GROUP BY query

2014-06-11 Thread Michael Armbrust
I'd try rerunning with master. It is likely you are running into SPARK-1994 https://issues.apache.org/jira/browse/SPARK-1994. Michael On Wed, Jun 11, 2014 at 3:01 AM, Pei-Lun Lee pl...@appier.com wrote: Hi, I am using spark 1.0.0 and found in spark sql some queries use GROUP BY give weird

Adding external jar to spark-shell classpath in spark 1.0

2014-06-11 Thread Ulanov, Alexander
Hi, I am currently using spark 1.0 locally on Windows 7. I would like to use classes from external jar in the spark-shell. I followed the instruction in: http://mail-archives.apache.org/mod_mbox/spark-user/201402.mbox/%3CCALrNVjWWF6k=c_jrhoe9w_qaacjld4+kbduhfv0pitr8h1f...@mail.gmail.com%3E I

Re: Adding external jar to spark-shell classpath in spark 1.0

2014-06-11 Thread Marcelo Vanzin
Ah, not that it should matter, but I'm on Linux and you seem to be on Windows... maybe there is something weird going on with the Windows launcher? On Wed, Jun 11, 2014 at 10:34 AM, Marcelo Vanzin van...@cloudera.com wrote: Just tried this and it worked fine for me: ./bin/spark-shell --jars

Re: pmml with augustus

2014-06-11 Thread Villu Ruusmann
Hello Spark/PMML enthusiasts, It's pretty trivial to integrate the JPMML-Evaluator library with Spark. In brief, take the following steps in your Spark application code: 1) Create a Java Map (arguments) that represents the input data record. You need to specify a key-value mapping for every

RE: Adding external jar to spark-shell classpath in spark 1.0

2014-06-11 Thread Ulanov, Alexander
Are you able to import any class from you jars within spark-shell? -Original Message- From: Marcelo Vanzin [mailto:van...@cloudera.com] Sent: Wednesday, June 11, 2014 9:36 PM To: user@spark.apache.org Subject: Re: Adding external jar to spark-shell classpath in spark 1.0 Ah, not that it

RE: Adding external jar to spark-shell classpath in spark 1.0

2014-06-11 Thread Ulanov, Alexander
Could you elaborate on this? I don’t have an application, I just use spark shell. From: Andrew Or [mailto:and...@databricks.com] Sent: Wednesday, June 11, 2014 9:40 PM To: user@spark.apache.org Subject: Re: Adding external jar to spark-shell classpath in spark 1.0 This is a known issue:

Re: HDFS Server/Client IPC version mismatch while trying to access HDFS files using Spark-0.9.1

2014-06-11 Thread Marcelo Vanzin
The error is saying that your client libraries are older than what your server is using (2.0.0-mr1-cdh4.6.0 is IPC version 7). Try double-checking that your build is actually using that version (e.g., by looking at the hadoop jar files in lib_managed/jars). On Wed, Jun 11, 2014 at 2:07 AM, bijoy

Re: Information on Spark UI

2014-06-11 Thread Shuo Xiang
Using MEMORY_AND_DISK_SER to persist the input RDD[Rating] seems to work right for me now. I'm testing on a larger dataset and will see how it goes. On Wed, Jun 11, 2014 at 9:56 AM, Neville Li neville@gmail.com wrote: Does cache eviction affect disk storage level too? I tried cranking up

Powered by Spark addition

2014-06-11 Thread Derek Mansen
Hello, I was wondering if we could add our organization to the Powered by Spark page. The information is: Name: Vistar Media URL: www.vistarmedia.com Description: Location technology company enabling brands to reach on-the-go consumers. Let me know if you need anything else. Thanks! Derek

Re: Having trouble with streaming (updateStateByKey)

2014-06-11 Thread Michael Campbell
I rearranged my code to do a reduceByKey which I think is working. I also don't think the problem was that updateState call, but something else; unfortunately I changed a lot in looking for this issue, so not sure what the actual fix might have been, but I think it's working now. On Wed, Jun

Kafka client - specify offsets?

2014-06-11 Thread Michael Campbell
Is there a way in the Apache Spark Kafka Utils to specify an offset to start reading? Specifically, from the start of the queue, or failing that, a specific point?

Re: Normalizations in MLBase

2014-06-11 Thread DB Tsai
Hi Aslan, Currently, we don't have the utility function to do so. However, you can easily implement this by another map transformation. I'm working on this feature now, and there will be couple different available normalization option users can chose. Sincerely, DB Tsai

Re: json parsing with json4s

2014-06-11 Thread Michael Cutler
Hello, You're absolutely right, the syntax you're using is returning the json4s value objects, not native types like Int, Long etc. fix that problem and then everything else (filters) will work as you expect. This is a short snippet of a larger example: [1] val lines =

Re: Powered by Spark addition

2014-06-11 Thread Matei Zaharia
Alright, added you. Matei On Jun 11, 2014, at 1:28 PM, Derek Mansen de...@vistarmedia.com wrote: Hello, I was wondering if we could add our organization to the Powered by Spark page. The information is: Name: Vistar Media URL: www.vistarmedia.com Description: Location technology company

Re: Not fully cached when there is enough memory

2014-06-11 Thread Xiangrui Meng
Could you try to click one that RDD and see the storage info per partition? I tried continuously caching RDDs, so new ones kick old ones out when there is not enough memory. I saw similar glitches but the storage info per partition is correct. If you find a way to reproduce this error, please

Re: Using Spark on Data size larger than Memory size

2014-06-11 Thread Allen Chang
Thanks. We've run into timeout issues at scale as well. We were able to workaround them by setting the following JVM options: -Dspark.akka.askTimeout=300 -Dspark.akka.timeout=300 -Dspark.worker.timeout=300 NOTE: these JVM options *must* be set on worker nodes (and not just the driver/master) for

Re: problem starting the history server on EC2

2014-06-11 Thread zhen
I tried everything including sudo, but it still did not work using the local directory. However, I finally got it working by getting the history server to log into hdfs. I first created a directory in hdfs like the following: ./ephemeral-hdfs/bin/hadoop fs -mkdir /spark_logs Then I started the

Re: Using Spark to crack passwords

2014-06-11 Thread Marek Wiewiorka
What about rainbow tables? http://en.wikipedia.org/wiki/Rainbow_table M. 2014-06-12 2:41 GMT+02:00 DB Tsai dbt...@stanford.edu: I think creating the samples in the search space within RDD will be too expensive, and the amount of data will probably be larger than any cluster. However, you

Re: Not fully cached when there is enough memory

2014-06-11 Thread Shuo Xiang
Xiangrui, clicking into the RDD link, it gives the same message, say only 96 of 100 partitions are cached. The disk/memory usage are the same, which is far below the limit. Is this what you want to check or other issue? On Wed, Jun 11, 2014 at 4:38 PM, Xiangrui Meng men...@gmail.com wrote:

Re: When to use CombineByKey vs reduceByKey?

2014-06-11 Thread Matei Zaharia
combineByKey is designed for when your return type from the aggregation is different from the values being aggregated (e.g. you group together objects), and it should allow you to modify the leftmost argument of each function (mergeCombiners, mergeValue, etc) and return that instead of

Re: Using Spark to crack passwords

2014-06-11 Thread Nicholas Chammas
Yes, I mean the RDD would just have elements to define partitions or ranges within the search space, not have actual hashes. It's really just a using the RDD as a control structure, rather than a real data set. As you noted, we don't need to store any hashes. We just need to check them as they

use spark-shell in the source

2014-06-11 Thread JaeBoo Jung
Title: Samsung Enterprise Portal mySingle Hi all, Can I use spark-shell programmatically in my spark application(in java or scala)? Because I want toconvert scalalines to string array and run automatically in my application. For example, for( var line - lines){ //run this line in spark