Re: Using Neo4j with Apache Spark

2015-03-12 Thread Marcin Cylke
On Thu, 12 Mar 2015 00:48:12 -0700 d34th4ck3r gautam1...@gmail.com wrote: I'm trying to use Neo4j with Apache Spark Streaming but I am finding serializability as an issue. Basically, I want Apache Spark to parse and bundle my data in real time. After, the data has been bundled it should be

Re: Using Neo4j with Apache Spark

2015-03-12 Thread Tathagata Das
What is GraphDatabaseService object that you are using? Instead of creating them on the driver (outside foreachRDD), can you create them inside the RDD.foreach? In general, the right pattern for doing this in the programming guide

Re: Using Neo4j with Apache Spark

2015-03-12 Thread Gautam Bajaj
I tried that too. It result in same serializability issue. GraphDatabaseSerive that I'm using is : GraphDatabaseFactory() : http://neo4j.com/api_docs/2.0.0/org/neo4j/graphdb/factory/GraphDatabaseFactory.html On Thu, Mar 12, 2015 at 5:21 PM, Tathagata Das t...@databricks.com wrote: What is

Re: Can't cache RDD of collaborative filtering on MLlib

2015-03-12 Thread Yuichiro Sakamoto
I got answer from mail posted to ML. --- Summary --- cache() is lazy, so you can use `RDD.count()` explicitly to load into memory. --- And I tried, two RDDs were cached and the speed became faster. Thank you. -- View this message in context:

connecting spark application with SAP hana

2015-03-12 Thread Hafiz Mujadid
Hi experts! Is there any way to connect SAP hana in spark application and get data from hana tables in our spark application? Thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/connecting-spark-application-with-SAP-hana-tp22011.html Sent from the

Compilation error

2015-03-12 Thread Mohit Anchlia
I am trying out streaming example as documented and I am using spark 1.2.1 streaming from maven for Java. When I add this code I get compilation error on and eclipse is not able to recognize Tuple2. I also don't see any import scala.Tuple2 class.

Using Neo4j with Apache Spark

2015-03-12 Thread d34th4ck3r
I'm trying to use Neo4j with Apache Spark Streaming but I am finding serializability as an issue. Basically, I want Apache Spark to parse and bundle my data in real time. After, the data has been bundled it should be stored in the database, Neo4j. However, I am getting this error:

Re: Solve least square problem of the form min norm(A x - b)^2^ + lambda * n * norm(x)^2 ?

2015-03-12 Thread Shivaram Venkataraman
There are a couple of differences between the ml-matrix implementation and the one used in AMPCamp - I think the AMPCamp one uses JBLAS which tends to ship native BLAS libraries along with it. In ml-matrix we switched to using Breeze + Netlib BLAS which is faster but needs some setup [1] to pick

Re: connecting spark application with SAP hana

2015-03-12 Thread Akhil Das
SAP hana can be integrated with hadoop http://saphanatutorial.com/sap-hana-and-hadoop/, so you will be able to read/write to it using newAPIHadoopFile api of spark by passing the correct Configurations etc. Thanks Best Regards On Thu, Mar 12, 2015 at 1:15 PM, Hafiz Mujadid

Re: Compilation error

2015-03-12 Thread Sean Owen
A couple points: You've got mismatched versions here -- 1.2.0 vs 1.2.1. You should fix that but it's not your problem. These are also supposed to be 'provided' scope dependencies in Maven. You should get the Scala deps transitively and can import scala.* classes. However, it would be a little

Re: Using Neo4j with Apache Spark

2015-03-12 Thread Tathagata Das
I am not sure if you realized but the code snipper it pretty mangled up in the email we received. It might be a good idea to put the code in pastebin or gist, much much easier for everyone to read. On Thu, Mar 12, 2015 at 12:48 AM, d34th4ck3r gautam1...@gmail.com wrote: I'm trying to use Neo4j

Re: Using Neo4j with Apache Spark

2015-03-12 Thread Gautam Bajaj
Alright, I have also asked this question in StackOverflow: http://stackoverflow.com/questions/28896898/using-neo4j-with-apache-spark The code there is pretty neat. On Thu, Mar 12, 2015 at 4:55 PM, Tathagata Das t...@databricks.com wrote: I am not sure if you realized but the code snipper it

Re: Using Neo4j with Apache Spark

2015-03-12 Thread Tathagata Das
Well the answers you got there are correct as well. Unfortunately I am not familiar with Neo4j enough to comment any more. Is the Neo4j graph database running externally (outside Spark cluster), or within the driver process, or on all the executors? Can you clarify that? TD On Thu, Mar 12, 2015

Re: Joining data using Latitude, Longitude

2015-03-12 Thread Andrew Musselman
Ted Dunning and Ellen Friedman's Time Series Databases has a section on this with some approaches to geo-encoding: https://www.mapr.com/time-series-databases-new-ways-store-and-access-data http://info.mapr.com/rs/mapr/images/Time_Series_Databases.pdf On Tue, Mar 10, 2015 at 3:53 PM, John Meehan

Re: Using Neo4j with Apache Spark

2015-03-12 Thread Gautam Bajaj
Neo4j is running externally. It has nothing to do with Spark processes. Basically, the problem is, I'm unable to figure out a way to store output of Spark on the database. As Spark Streaming requires Neo4j Core Java API to be serializable as well. The answer points out to using REST API but

Re: Unable to saveToCassandra while cassandraTable works fine

2015-03-12 Thread Gerard Maas
This: java.lang.NoSuchMethodError almost always indicates a version conflict somewhere. It looks like you are using Spark 1.1.1 with the cassandra-spark connector 1.2.0. Try aligning those. Those metrics were introduced recently in the 1.2.0 branch of the cassandra connector. Either upgrade your

Re: Compilation error

2015-03-12 Thread Mohit Anchlia
It works after sync, thanks for the pointers On Tue, Mar 10, 2015 at 1:22 PM, Mohit Anchlia mohitanch...@gmail.com wrote: I navigated to maven dependency and found scala library. I also found Tuple2.class and when I click on it in eclipse I get invalid LOC header (bad signature)

RE: sc.textFile() on windows cannot access UNC path

2015-03-12 Thread Wang, Ningjun (LNG-NPV)
Thanks for the reference. Is the following procedure correct? 1.Copy of the Hadoop source code org.apache.hadoop.mapreduce.lib.input .TextInputFormat.java as my own class, e.g. UncTextInputFormat.java 2.Modify UncTextInputFormat.java to handle UNC path 3.Call

RE: hbase sql query

2015-03-12 Thread Udbhav Agarwal
Thanks Akhil. Additionaly if we want to do sql query we need to create JavaPairRdd, then JavaRdd, then JavaSchemaRdd and then sqlContext.sql(sql query). Ryt ? Thanks, Udbhav Agarwal From: Akhil Das [mailto:ak...@sigmoidanalytics.com] Sent: 12 March, 2015 11:43 AM To: Udbhav Agarwal Cc:

Re: Spark Streaming recover from Checkpoint with Spark SQL

2015-03-12 Thread Marius Soutier
Thanks, the new guide did help - instantiating the SQLContext inside foreachRDD did the trick for me, but the SQLContext singleton works as well. Now the only problem left is that spark.driver.port is not retained after starting from a checkpoint, so my Actor receivers are running on a random

RE: SQL with Spark Streaming

2015-03-12 Thread Huang, Jie
@Tobias, According to my understanding, your approach is to register a series of tables by using transformWith, right? And then, you can get a new Dstream (i.e., SchemaDstream), which consists of lots of SchemaRDDs. Please correct me if my understanding is wrong. Thank you Best Regards,

Re: S3 SubFolder Write Issues

2015-03-12 Thread Harshvardhan Chauhan
I use s3n://BucketName/SomeFoler/OutputFolder and it works for my app. On Wed, Mar 11, 2015 at 12:14 AM, Akhil Das ak...@sigmoidanalytics.com wrote: Does it write anything in BUCKET/SUB_FOLDER/output? Thanks Best Regards On Wed, Mar 11, 2015 at 10:15 AM, cpalm3 cpa...@gmail.com wrote:

Re: Unable to read files In Yarn Mode of Spark Streaming ?

2015-03-12 Thread Prannoy
Streaming takes only new files into consideration. Add the file after starting the job. On Thu, Mar 12, 2015 at 2:26 PM, CH.KMVPRASAD [via Apache Spark User List] ml-node+s1001560n2201...@n3.nabble.com wrote: yes ! for testing purpose i defined single file in the specified directory

RE: hbase sql query

2015-03-12 Thread Udbhav Agarwal
Thanks Todd, But this link is also based on scala, I was looking for some help with java Apis. Thanks, Udbhav Agarwal From: Todd Nist [mailto:tsind...@gmail.com] Sent: 12 March, 2015 5:28 PM To: Udbhav Agarwal Cc: Akhil Das; user@spark.apache.org Subject: Re: hbase sql query Have you

Re: hbase sql query

2015-03-12 Thread Todd Nist
Ah, missed that java was a requirement. What distribution of Hadoop are you suing? Here is an example that may help, along with a few links to the JavaHbaseContext and a basic example. https://github.com/tmalaska/SparkOnHBase

Re: Read parquet folders recursively

2015-03-12 Thread Akhil Das
With fileStream you are free to plugin any InputFormat, in your case, you can easily plugin ParquetInputFormat. Here' some parquet hadoop examples https://github.com/Parquet/parquet-mr/tree/master/parquet-hadoop/src/main/java/parquet/hadoop/example . Thanks Best Regards On Thu, Mar 12, 2015 at

spark sql performance

2015-03-12 Thread Udbhav Agarwal
Hi, What is query time for join query on hbase with spark sql. Say tables in hbase have 0.5 million records each. I am expecting a query time (latency) in milliseconds with spark sql. Can this be possible ? Thanks, Udbhav Agarwal

Re: Timed out while stopping the job generator plus subsequent failures

2015-03-12 Thread Sean Owen
I don't think that's the same issue I was seeing, but you can have a look at https://issues.apache.org/jira/browse/SPARK-4545 for more detail on my issue. On Thu, Mar 12, 2015 at 12:51 AM, Tobias Pfeiffer t...@preferred.jp wrote: Sean, On Wed, Mar 11, 2015 at 7:43 PM, Tobias Pfeiffer

Re: Building Spark 1.3 for Scala 2.11 using Maven

2015-03-12 Thread Fernando O.
Just FYI: what @Marcelo said fixed the issue for me. On Fri, Mar 6, 2015 at 7:11 AM, Sean Owen so...@cloudera.com wrote: -Pscala-2.11 and -Dscala-2.11 will happen to do the same thing for this profile. Why are you running install package and not just install? Probably doesn't matter. This

RE: hbase sql query

2015-03-12 Thread Udbhav Agarwal
Thanks Todd, But this particular code is missing JavaHbaseContext.java file. The code only has JavaHbaseContext.scala. Moreover I specifically wanted to perform join queries on tables with millions of records and expect millisecond latency. For doing a sql query of such type I guess we need

Re: Unable to stop Worker in standalone mode by sbin/stop-all.sh

2015-03-12 Thread Ted Yu
Interesting. Short term, maybe create the following file with pid 24922 ? /tmp/spark-taoewang-org.apache.spark.deploy.worker.Worker-1.pid Cheers On Thu, Mar 12, 2015 at 6:51 PM, sequoiadb mailing-list-r...@sequoiadb.com wrote: Nope, I can see the master file exist but not the worker: $ ls

RE: Handling worker batch processing during driver shutdown

2015-03-12 Thread Jose Fernandez
Thanks for the reply! Theoretically I should be able to do as you suggest as I follow the pool design pattern from the documentation, but I don’t seem to be able to run any code after .stop() is called. override def main(args: Array[String]) { // setup val ssc = new

Re: Using Neo4j with Apache Spark

2015-03-12 Thread Tathagata Das
Well, that's why I had also suggested using a pool of the GraphDBService objects :) Also present in the programming guide link I had given. TD On Thu, Mar 12, 2015 at 7:38 PM, Gautam Bajaj gautam1...@gmail.com wrote: Thanks a ton! That worked. However, this may have performance issue. As

Re: Using Neo4j with Apache Spark

2015-03-12 Thread Gautam Bajaj
Thanks a ton! That worked. However, this may have performance issue. As for each partition, I'd need to restart the server, that was the basic reason I was creating graphDb object outside this loop. On Fri, Mar 13, 2015 at 5:34 AM, Tathagata Das t...@databricks.com wrote: (Putting user@spark

Re: Support for skewed joins in Spark

2015-03-12 Thread Shixiong Zhu
I sent a PR to add skewed join last year: https://github.com/apache/spark/pull/3505 However, it does not split a key to multiple partitions. Instead, if a key has too many values that can not be fit in to memory, it will store the values into the disk temporarily and use disk files to do the join.

Re: Handling worker batch processing during driver shutdown

2015-03-12 Thread Tathagata Das
What version of Spark are you using. You may be hitting a known but solved bug where the receivers would not get stop signal and (stopGracefully = true) would wait for a while for the receivers to stop indefinitely. Try setting stopGracefully to false and see if it works. This bug should have been

Re: spark sql writing in avro

2015-03-12 Thread Kevin Peng
Dale, I basically have the same maven dependency above, but my code will not compile due to not being able to reference to AvroSaver, though the saveAsAvro reference compiles fine, which is weird. Eventhough saveAsAvro compiles for me, it errors out when running the spark job due to it not being

Re: Support for skewed joins in Spark

2015-03-12 Thread Soila Pertet Kavulya
Thanks Shixiong, I'll try out your PR. Do you know what the status of the PR is? Are there any plans to incorporate this change to the DataFrames/SchemaRDDs in Spark 1.3? Soila On Thu, Mar 12, 2015 at 7:52 PM, Shixiong Zhu zsxw...@gmail.com wrote: I sent a PR to add skewed join last year:

run spark standalone mode

2015-03-12 Thread Grandl Robert
Hi guys, I have a stupid question, but I am not sure how to get out of it.  I deployed spark 1.2.1 on a cluster of 30 nodes. Looking at master:8088 I can see all the workers I have created so far. (I start the cluster with sbin/start-all.sh) However, when running a Spark SQL query or even

SPARKQL Join partitioner

2015-03-12 Thread gtanguy
Hello, I am wondering how does /join/ work in SparkQL? Does it co-partition two tables? or does it do it by wide dependency? I have two big tables to join, the query creates more than 150Go temporary data, so the query stops because I have no space left my disk. I guess I could use a

Re: run spark standalone mode

2015-03-12 Thread Grandl Robert
Sorry guys for this. It seems that I need to start the thrift server with --master spark://ms0220:7077 option and now I can see applications running in my web UI. Thanks,Robert On Thursday, March 12, 2015 10:57 AM, Grandl Robert rgra...@yahoo.com.INVALID wrote: I figured out for

How to consider HTML files in Spark

2015-03-12 Thread yh18190
Hi.I am very much fascinated to Spark framework.I am trying to use Pyspark + Beautifulsoup to parse HTML files.I am facing problems to load html file into beautiful soup. Example filepath= file:///path to html directory def readhtml(inputhtml): { soup=Beautifulsoup(inputhtml) //to load html

Jackson-core-asl conflict with Spark

2015-03-12 Thread Uthayan Suthakar
Hello Guys, I'm running into below error: Exception in thread main java.lang.NoClassDefFoundError: org/codehaus/jackson/annotate/JsonClass I have created a uber jar with Jackson-core-asl.1.9.13 and passed it with --jars configuration, but still getting errors. I searched on the net and found a

Re: Jackson-core-asl conflict with Spark

2015-03-12 Thread Ted Yu
Looking at dependency tree: [INFO] +- org.spark-project.hive:hive-exec:jar:0.13.1a:compile ... [INFO] | +- org.codehaus.jackson:jackson-core-asl:jar:1.9.2:compile In root pom.xml : dependency groupIdorg.codehaus.jackson/groupId artifactIdjackson-core-asl/artifactId

Re: Writing to a single file from multiple executors

2015-03-12 Thread Tathagata Das
If you use DStream.saveAsHadoopFiles (or equivalent RDD ops) with the appropriate output format (for Avro) then each partition of the RDDs will be written to a different file. However there is probably going to be a large number of small files and you may have to run a separate compaction phase to

Re: How to consider HTML files in Spark

2015-03-12 Thread Davies Liu
sc.wholeTextFile() is what you need. http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.SparkContext.wholeTextFiles On Thu, Mar 12, 2015 at 9:26 AM, yh18190 yh18...@gmail.com wrote: Hi.I am very much fascinated to Spark framework.I am trying to use Pyspark + Beautifulsoup to

Re: Workaround for spark 1.2.X roaringbitmap kryo problem?

2015-03-12 Thread Arun Luthra
I'm using a pre-built Spark; I'm not trying to compile Spark. The compile error appears when I try to register HighlyCompressedMapStatus in my program: kryo.register(classOf[org.apache.spark.scheduler.HighlyCompressedMapStatus]) If I don't register it, I get a runtime error saying that it needs

KafkaUtils and specifying a specific partition

2015-03-12 Thread ColinMc
Hi, How do you use KafkaUtils to specify a specific partition? I'm writing customer Marathon jobs where a customer is given 1 partition in a topic in Kafka. The job will get the partition from our database for that customer and use that to get the messages for that customer. I misinterpreted

Re: Jackson-core-asl conflict with Spark

2015-03-12 Thread Sean Owen
What version of Hadoop are you using, and if so are you setting the right Hadoop profile? because they already set the Jackson version to 1.9.13 IIRC. So maybe that's the issue. On Thu, Mar 12, 2015 at 5:15 PM, Ted Yu yuzhih...@gmail.com wrote: Looking at dependency tree: [INFO] +-

Re: Writing to a single file from multiple executors

2015-03-12 Thread Maiti, Samya
Hi TD, I want to append my record to a AVRO file which will be later used for querying. Having a single file is not mandatory for us but then how can we make the executors append the AVRO data to multiple files. Thanks, Sam On Mar 12, 2015, at 4:09 AM, Tathagata Das

Re: Jackson-core-asl conflict with Spark

2015-03-12 Thread Paul Brown
So... one solution would be to use a non-Jurassic version of Jackson. 2.6 will drop before too long, and 3.0 is in longer-term planning. The 1.x series is long deprecated. If you're genuinely stuck with something ancient, then you need to include the JAR that contains the class, and 1.9.13 does

Re: Workaround for spark 1.2.X roaringbitmap kryo problem?

2015-03-12 Thread Imran Rashid
Giving a bit more detail on the error would make it a lot easier for others to help you out. Eg., in this case, it would have helped if included your actual compile error. In any case, I'm assuming your issue is b/c that class if private to spark. You can sneak around that by using

Re: KafkaUtils and specifying a specific partition

2015-03-12 Thread Colin McQueen
Thanks! :) Colin McQueen *Software Developer* On Thu, Mar 12, 2015 at 3:05 PM, Jeffrey Jedele jeffrey.jed...@gmail.com wrote: Hi Colin, my understanding is that this is currently not possible with KafkaUtils. You would have to write a custom receiver using Kafka's SimpleConsumer API.

Re: run spark standalone mode

2015-03-12 Thread Grandl Robert
I figured out for spark-shell by passing the --master option. However, still troubleshooting for launching sql queries. My current command is like that: ./bin/beeline -u jdbc:hive2://ms0220:1 -n `whoami` -p ignored -f tpch_query10.sql On Thursday, March 12, 2015 10:37 AM, Grandl

Re: KafkaUtils and specifying a specific partition

2015-03-12 Thread Cody Koeninger
KafkaUtils.createDirectStream, added in spark 1.3, will let you specify a particular topic and partition On Thu, Mar 12, 2015 at 1:07 PM, Colin McQueen colin.mcqu...@shiftenergy.com wrote: Thanks! :) Colin McQueen *Software Developer* On Thu, Mar 12, 2015 at 3:05 PM, Jeffrey Jedele

Re: Handling worker batch processing during driver shutdown

2015-03-12 Thread Tathagata Das
Can you access the batcher directly? Like is there is there a handle to get access to the batchers on the executors by running a task on that executor? If so, after the streamingContext has been stopped (not the SparkContext), then you can use `sc.makeRDD()` to run a dummy task like this.

Re: Jackson-core-asl conflict with Spark

2015-03-12 Thread Uthayan Suthakar
Thank you all, I tried this -Dcodehaus.jackson.version=1.9.13 as suggested by Ted Yu and it resolved my issue. Many thanks. Cheers, Uthay. On 12 March 2015 at 17:22, Ted Yu yuzhih...@gmail.com wrote: Uthay: You can run mvn dependency:tree command to find out the actual jackson-core-asl

Re: Efficient Top count in each window

2015-03-12 Thread Tathagata Das
Why are you repartitioning 1? That would obviously be slow, you are converting a distributed operation to a single node operation. Also consider using RDD.top(). If you define the ordering right (based on the count), then you will get top K across then without doing a shuffle for sortByKey. Much

Re: Which is more efficient : first join three RDDs and then do filtering or vice versa?

2015-03-12 Thread Daniel Siegmann
Join causes a shuffle (sending data across the network). I expect it will be better to filter before you join, so you reduce the amount of data which is sent across the network. Note this would be true for *any* transformation which causes a shuffle. It would not be true if you're combining RDDs

AWS SDK HttpClient version conflict (spark.files.userClassPathFirst not working)

2015-03-12 Thread Adam Lewandowski
I'm trying to use the AWS SDK (v1.9.23) to connect to DynamoDB from within a Spark application. Spark 1.2.1 is assembled with HttpClient 4.2.6, but the AWS SDK is depending on HttpClient 4.3.4 for it's communication with DynamoDB. The end result is an error when the app tries to connect to

Re: PairRDD serialization exception

2015-03-12 Thread dylanhockey
I have the same exact error. Am running a pyspark job in yarn-client mode. Works well in standalone but I need to run it in yarn-client mode. Other people reported the same problem when bundling jars and extra dependencies. I'm pointing the pyspark to use a specific python executable bundled

Re: Unable to find org.apache.spark.sql.catalyst.ScalaReflection class

2015-03-12 Thread Jianshi Huang
BTW, I was running tests from SBT when I get the errors. One test turn a Seq of case class to DataFrame. I also tried run similar code in the console, but failed with same error. I tested both Spark 1.3.0-rc2 and 1.2.1 with Scala 2.11.6 and 2.10.4 Any idea? Jianshi On Fri, Mar 13, 2015 at

Re: KafkaUtils and specifying a specific partition

2015-03-12 Thread Jeffrey Jedele
Hi Colin, my understanding is that this is currently not possible with KafkaUtils. You would have to write a custom receiver using Kafka's SimpleConsumer API. https://spark.apache.org/docs/1.2.0/streaming-custom-receivers.html

Efficient Top count in each window

2015-03-12 Thread Laeeq Ahmed
Hi,  I have a streaming application where am doing top 10 count in each window which seems slow. Is there efficient way to do this. val counts = keyAndValues.map(x = math.round(x._3.toDouble)).countByValueAndWindow(Seconds(4), Seconds(4))        val topCounts =

Re: Workaround for spark 1.2.X roaringbitmap kryo problem?

2015-03-12 Thread Arun Luthra
The error is in the original post. Here's the recipe that worked for me: kryo.register(Class.forName(org.roaringbitmap.RoaringArray$Element)) kryo.register(classOf[Array[org.roaringbitmap.RoaringArray$Element]]) kryo.register(classOf[Array[Short]])

Re: Unable to find org.apache.spark.sql.catalyst.ScalaReflection class

2015-03-12 Thread Jianshi Huang
Same issue here. But the classloader in my exception is somehow different. scala.ScalaReflectionException: class org.apache.spark.sql.catalyst.ScalaReflection in JavaMirror with java.net.URLClassLoader@53298398 of type class java.net.URLClassLoader with classpath Jianshi On Sun, Mar 1, 2015 at

Re: can not submit job to spark in windows

2015-03-12 Thread Arush Kharbanda
Seems the path is not set correctly. Its looking for C:\\bin\winutils.exe . You would need to set the path correctly. On Thu, Feb 26, 2015 at 7:59 PM, sergunok ser...@gmail.com wrote: Hi! I downloaded and extracted Spark to local folder under windows 7 and have successfully played with it in

Re: KafkaUtils and specifying a specific partition

2015-03-12 Thread mykidong
If you want to use another kafka receiver instead of current spark kafka receiver, You can see this: https://github.com/mykidong/spark-kafka-simple-consumer-receiver/blob/master/src/main/java/spark/streaming/receiver/kafka/KafkaReceiverUtils.java You can handle to get just the stream from the

Logistic Regression displays ERRORs

2015-03-12 Thread cjwang
I am running LogisticRegressionWithLBFGS. I got these lines on my console: 2015-03-12 17:38:03,897 ERROR breeze.optimize.StrongWolfeLineSearch | Encountered bad values in function evaluation. Decreasing step size to 0.5 2015-03-12 17:38:03,967 ERROR breeze.optimize.StrongWolfeLineSearch |

Re: Solve least square problem of the form min norm(A x - b)^2^ + lambda * n * norm(x)^2 ?

2015-03-12 Thread Jaonary Rabarisoa
In fact, by activating netlib with native libraries it goes faster. Thanks On Tue, Mar 10, 2015 at 7:03 PM, Shivaram Venkataraman shiva...@eecs.berkeley.edu wrote: There are a couple of differences between the ml-matrix implementation and the one used in AMPCamp - I think the AMPCamp one

Re: Solve least square problem of the form min norm(A x - b)^2^ + lambda * n * norm(x)^2 ?

2015-03-12 Thread Shivaram Venkataraman
On Thu, Mar 12, 2015 at 3:05 PM, Jaonary Rabarisoa jaon...@gmail.com wrote: In fact, by activating netlib with native libraries it goes faster. Glad you got it work ! Better performance was one of the reasons we made the switch. Thanks On Tue, Mar 10, 2015 at 7:03 PM, Shivaram Venkataraman

Re: Is there a limit to the number of RDDs in a Spark context?

2015-03-12 Thread Juan Rodríguez Hortalá
Hi, It's been some time since my last message on the subject of using many RDDs in a Spark job, but I have just encountered the same problem again. The thing it's that I have an RDD of time tagged data, that I want to 1) divide into windows according to a timestamp field; 2) compute KMeans for

Re: Read parquet folders recursively

2015-03-12 Thread Masf
Hi. Thanks for your answers, but, to read parquet files is necessary to use parquetFile method in org.apache.spark.sql.SQLContext, is it true? How can I combine your solution with the called to this method? Thanks!! Regards On Thu, Mar 12, 2015 at 8:34 AM, Yijie Shen henry.yijies...@gmail.com

Re: hbase sql query

2015-03-12 Thread Todd Nist
Have you considered using the spark-hbase-connector for this: https://github.com/nerdammer/spark-hbase-connector On Thu, Mar 12, 2015 at 5:19 AM, Udbhav Agarwal udbhav.agar...@syncoms.com wrote: Thanks Akhil. Additionaly if we want to do sql query we need to create JavaPairRdd, then

Error running rdd.first on hadoop

2015-03-12 Thread Lau, Kawing (GE Global Research)
Hi I was running with spark-1.3.0-snapshot rdd = sc.textFile(hdfs://X.X.X.X/data) rdd.first() Then I got this error: Traceback (most recent call last): File stdin, line 1, in module File /pyspark/rdd.py, line 1243, in first rs = self.take(1) File /pyspark/rdd.py, line 1195, in take

spark sql writing in avro

2015-03-12 Thread kpeng1
Hi All, I am current trying to write out a scheme RDD to avro. I noticed that there is a databricks spark-avro library and I have included that in my dependencies, but it looks like I am not able to access the AvroSaver object. On compilation of the job I get this: error: not found: value

Limit # of parallel parquet decompresses

2015-03-12 Thread ankits
My jobs frequently run out of memory if the #of cores on an executor is too high, because each core launches a new parquet decompressor thread, which allocates memory off heap to decompress. Consequently, even with say 12 cores on an executor, depending on the memory, I can only use 2-3 to avoid

repartitionAndSortWithinPartitions and mapPartitions and sort order

2015-03-12 Thread Darin McBeath
I am using repartitionAndSortWithinPartitions to partition my content and then sort within each partition. I've also created a custom partitioner that I use with repartitionAndSortWithinPartitions. I created a custom partitioner as my key consist of something like 'groupid|timestamp' and I

Re: AWS SDK HttpClient version conflict (spark.files.userClassPathFirst not working)

2015-03-12 Thread 浦野 裕也
Hi Adam, Could you try building spark with profile -Pkinesis-asl. mvn -Pkinesis-asl -DskipTests clean package refers to 'Running the Example' section. https://spark.apache.org/docs/latest/streaming-kinesis-integration.html In fact, I've seen same issue and have been able to use the AWS SDK by

Re: NegativeArraySizeException when doing joins on skewed data

2015-03-12 Thread Soila Pertet Kavulya
Hi Tristan, Did upgrading to Kryo3 help? Thanks, Soila On Sun, Mar 1, 2015 at 2:48 PM, Tristan Blakers tris...@blackfrog.org wrote: Yeah I implemented the same solution. It seems to kick in around the 4B mark, but looking at the log I suspect it’s probably a function of the number of unique

Unable to stop Worker in standalone mode by sbin/stop-all.sh

2015-03-12 Thread sequoiadb
Checking the script, it seems spark-daemon.sh unable to stop the worker $ ./spark-daemon.sh stop org.apache.spark.deploy.worker.Worker 1 no org.apache.spark.deploy.worker.Worker to stop $ ps -elf | grep spark 0 S taoewang 24922 1 0 80 0 - 733878 futex_ Mar12 ? 00:08:54 java -cp

Re: Unable to stop Worker in standalone mode by sbin/stop-all.sh

2015-03-12 Thread Ted Yu
Does the machine have cron job that periodically cleans up /tmp dir ? Cheers On Thu, Mar 12, 2015 at 6:18 PM, sequoiadb mailing-list-r...@sequoiadb.com wrote: Checking the script, it seems spark-daemon.sh unable to stop the worker $ ./spark-daemon.sh stop org.apache.spark.deploy.worker.Worker

Support for skewed joins in Spark

2015-03-12 Thread Soila Pertet Kavulya
Does Spark support skewed joins similar to Pig which distributes large keys over multiple partitions? I tried using the RangePartitioner but I am still experiencing failures because some keys are too large to fit in a single partition. I cannot use broadcast variables to work-around this because

Re: spark sql writing in avro

2015-03-12 Thread M. Dale
Short answer: if you downloaded spark-avro from the repo.maven.apache.org repo you might be using an old version (pre-November 14, 2014) - see timestamps at http://repo.maven.apache.org/maven2/com/databricks/spark-avro_2.10/0.1/ Lots of changes at https://github.com/databricks/spark-avro since

Re: Unable to stop Worker in standalone mode by sbin/stop-all.sh

2015-03-12 Thread sequoiadb
Nope, I can see the master file exist but not the worker: $ ls bitrock_installer.log hsperfdata_root hsperfdata_taoewang omatmp sbt2435921113715137753.log spark-taoewang-org.apache.spark.deploy.master.Master-1.pid 在 2015年3月13日,上午9:34,Ted Yu yuzhih...@gmail.com 写道: Does the machine have cron

Re: sc.textFile() on windows cannot access UNC path

2015-03-12 Thread Akhil Das
Sounds like the way of doing it. Could you try accessing a file from UNC Path with native Java nio code and make sure it is able access it with the URI file:10.196.119.230/folder1/abc.txt? Thanks Best Regards On Wed, Mar 11, 2015 at 7:45 PM, Wang, Ningjun (LNG-NPV)

Re: Partitioning Dataset and Using Reduce in Apache Spark

2015-03-12 Thread raghav0110.cs
Thank you very much for your detailed response, it was very informative and cleared up some of my misconceptions. After your explanation, I understand that the distribution of the data and parallelism is all meant to be an abstraction to the developer. In your response you say “When you

Re: hbase sql query

2015-03-12 Thread Akhil Das
Like this? val hBaseRDD = sc.newAPIHadoopRDD(conf, classOf[TableInputFormat], classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable], classOf[org.apache.hadoop.hbase.client.Result]).cache() Here's a complete example

Re: bad symbolic reference. A signature in SparkContext.class refers to term conf in value org.apache.hadoop which is not available

2015-03-12 Thread Akhil Das
Spark 1.3.0 is not officially out yet, so i don't think sbt will download the hadoop dependencies for your spark by itself. You could try manually adding the hadoop dependencies yourself (hadoop-core, hadoop-common, hadoop-client) Thanks Best Regards On Wed, Mar 11, 2015 at 9:07 PM, Patcharee

Re: StreamingListener

2015-03-12 Thread Akhil Das
At the end of foreachrdd i believe. Thanks Best Regards On Thu, Mar 12, 2015 at 6:48 AM, Corey Nolet cjno...@gmail.com wrote: Given the following scenario: dstream.map(...).filter(...).window(...).foreachrdd() When would the onBatchCompleted fire?

Re: Read parquet folders recursively

2015-03-12 Thread Akhil Das
Hi We have a custom build to read directories recursively, Currently we use it with fileStream like: val lines = ssc.fileStream[LongWritable, Text, TextInputFormat](/datadumps/, (t: Path) = true, true, *true*) Making the 4th argument true to read recursively. You could give it a try

Re: Unable to read files In Yarn Mode of Spark Streaming ?

2015-03-12 Thread Prannoy
Are the files already present in HDFS before you are starting your application ? On Thu, Mar 12, 2015 at 11:11 AM, CH.KMVPRASAD [via Apache Spark User List] ml-node+s1001560n22008...@n3.nabble.com wrote: Hi am successfully executed sparkPi example on yarn mode but i cant able to read files

Re: Define exception handling on lazy elements?

2015-03-12 Thread Sean Owen
Handling exceptions this way means handling errors on the driver side, which may or may not be what you want. You can also write functions with exception handling inside, which could make more sense in some cases (like, to ignore bad records or count them or something). If you want to handle

Re: Read parquet folders recursively

2015-03-12 Thread Yijie Shen
org.apache.spark.deploy.SparkHadoopUtil has a method: /**    * Get [[FileStatus]] objects for all leaf children (files) under the given base path. If the    * given path points to a file, return a single-element collection containing [[FileStatus]] of    * that file.    */   def