Re: filtering out non English tweets using TwitterUtils

2014-11-11 Thread Ryan Compton
Fwiw if you do decide to handle language detection on your machine this library works great on tweets https://github.com/carrotsearch/langid-java On Tue, Nov 11, 2014, 7:52 PM Tobias Pfeiffer t...@preferred.jp wrote: Hi, On Wed, Nov 12, 2014 at 5:42 AM, SK skrishna...@gmail.com wrote: But

Re: Java IO Stream Corrupted - Invalid Type AC?

2014-06-06 Thread Ryan Compton
Just ran into this today myself. I'm on branch-1.0 using a CDH3 cluster (no modifications to Spark or its dependencies). The error appeared trying to run GraphX's .connectedComponents() on a ~200GB edge list (GraphX worked beautifully on smaller data). Here's the stacktrace (it's quite similar to

Re: Spark 1.0: slf4j version conflicts with pig

2014-05-28 Thread Ryan Compton
/bidirectional-network-current/part-r-1' USING PigStorage() AS (id1:long, id2:long, weight:int); ttt = LIMIT edgeList0 10; DUMP ttt; On Wed, May 28, 2014 at 12:55 PM, Ryan Compton compton.r...@gmail.com wrote: It appears to be Spark 1.0 related. I made a pom.xml with a single dependency on Spark

Re: Spark 1.0: slf4j version conflicts with pig

2014-05-28 Thread Ryan Compton
posted a JIRA https://issues.apache.org/jira/browse/SPARK-1952 On Wed, May 28, 2014 at 1:14 PM, Ryan Compton compton.r...@gmail.com wrote: Remark, just including the jar built by sbt will produce the same error. i,.e this pig script will fail: REGISTER /usr/share/osi1/spark-1.0.0/assembly

Spark 1.0: slf4j version conflicts with pig

2014-05-27 Thread Ryan Compton
I use both Pig and Spark. All my code is built with Maven into a giant *-jar-with-dependencies.jar. I recently upgraded to Spark 1.0 and now all my pig scripts fail with: Caused by: java.lang.RuntimeException: Could not resolve error that occured when launching map reduce job:

Re: GraphX: Help understanding the limitations of Pregel

2014-04-23 Thread Ryan Compton
, Ryan Compton compton.r...@gmail.com wrote: I'm trying shoehorn a label propagation-ish algorithm into GraphX. I need to update each vertex with the median value of their neighbors. Unlike PageRank, which updates each vertex with the mean of their neighbors, I don't have a simple commutative

GraphX: .edges.distinct().count() is 10?

2014-04-22 Thread Ryan Compton
I am trying to read an edge list into a Graph. My data looks like 394365859 -- 136153151 589404147 -- 1361045425 I read it into a Graph via: val edgeFullStrRDD: RDD[String] = sc.textFile(unidirFName) val edgeTupRDD = edgeFullStrRDD.map(x = x.split(\t)) .map(x

Re: GraphX: .edges.distinct().count() is 10?

2014-04-22 Thread Ryan Compton
Try this: https://www.dropbox.com/s/xf34l0ta496bdsn/.txt This code: println(g.numEdges) println(g.numVertices) println(g.edges.distinct().count()) gave me 1 9294 2 On Tue, Apr 22, 2014 at 5:14 PM, Ankur Dave ankurd...@gmail.com wrote: I wasn't able to reproduce this

Re: distinct on huge dataset

2014-04-17 Thread Ryan Compton
Does this continue in newer versions? (I'm on 0.8.0 now) When I use .distinct() on moderately large datasets (224GB, 8.5B rows, I'm guessing about 500M are distinct) my jobs fail with: 14/04/17 15:04:02 INFO cluster.ClusterTaskSetManager: Loss was due to java.io.FileNotFoundException

Re: distinct on huge dataset

2014-04-17 Thread Ryan Compton
Btw, I've got System.setProperty(spark.shuffle.consolidate.files, true) and use ext3 (CentOS...) On Thu, Apr 17, 2014 at 3:20 PM, Ryan Compton compton.r...@gmail.com wrote: Does this continue in newer versions? (I'm on 0.8.0 now) When I use .distinct() on moderately large datasets (224GB, 8.5B

All pairs shortest paths?

2014-03-26 Thread Ryan Compton
No idea how feasible this is. Has anyone done it?

Re: All pairs shortest paths?

2014-03-26 Thread Ryan Compton
To clarify: I don't need the actual paths, just the distances. On Wed, Mar 26, 2014 at 3:04 PM, Ryan Compton compton.r...@gmail.com wrote: No idea how feasible this is. Has anyone done it?