Each of the records will be having a sequence id .No duplicates

On Mon, Sep 19, 2016 at 11:42 AM, ayan guha <guha.a...@gmail.com> wrote:

> And how do you define missing sequence? Can you give an example?
>
> On Mon, Sep 19, 2016 at 3:48 PM, Sudhindra Magadi <smag...@gmail.com>
> wrote:
>
>> Hi Jorn ,
>>  We have a file with billion records.We want to find if there any missing
>> sequences here .If so what are they ?
>> Thanks
>> Sudhindra
>>
>> On Mon, Sep 19, 2016 at 11:12 AM, Jörn Franke <jornfra...@gmail.com>
>> wrote:
>>
>>> I am not sure what you try to achieve here. Can you please tell us what
>>> the goal of the program is. Maybe with some example data?
>>>
>>> Besides this, I have the feeling that it will fail once it is not used
>>> in a single node scenario due to the reference to the global counter
>>> variable.
>>>
>>> Also unclear why you collect the data first to parallelize it again.
>>>
>>> On 18 Sep 2016, at 14:26, sudhindra <smag...@gmail.com> wrote:
>>>
>>> Hi i have coded something like this , pls tell me how bad it is .
>>>
>>> package Spark.spark;
>>> import java.util.List;
>>> import java.util.function.Function;
>>>
>>> import org.apache.spark.SparkConf;
>>> import org.apache.spark.SparkContext;
>>> import org.apache.spark.api.java.JavaRDD;
>>> import org.apache.spark.api.java.JavaSparkContext;
>>> import org.apache.spark.sql.DataFrame;
>>> import org.apache.spark.sql.Dataset;
>>> import org.apache.spark.sql.Row;
>>> import org.apache.spark.sql.SQLContext;
>>>
>>>
>>>
>>> public class App
>>> {
>>>    static long counter=1;
>>>    public static void main( String[] args )
>>>    {
>>>
>>>
>>>
>>>        SparkConf conf = new
>>> SparkConf().setAppName("sorter").setMaster("local[2]").set("
>>> spark.executor.memory","1g");
>>>        JavaSparkContext sc = new JavaSparkContext(conf);
>>>
>>>        SQLContext sqlContext = new org.apache.spark.sql.SQLContext(sc);
>>>
>>>        DataFrame df = sqlContext.read().json("path");
>>>        DataFrame sortedDF = df.sort("id");
>>>        //df.show();
>>>        //sortedDF.printSchema();
>>>
>>>        System.out.println(sortedDF.collectAsList().toString());
>>>        JavaRDD<Row> distData = sc.parallelize(sortedDF.collectAsList());
>>>
>>>
>>>     List<String >missingNumbers=distData.map(new
>>> org.apache.spark.api.java.function.Function<Row, String>() {
>>>
>>>
>>>            public String call(Row arg0) throws Exception {
>>>                // TODO Auto-generated method stub
>>>
>>>
>>>                if(counter!=new Integer(arg0.getString(0)).intValue())
>>>                {
>>>                    StringBuffer misses = new StringBuffer();
>>>                    long newCounter=counter;
>>>                    while(newCounter!=new Integer(arg0.getString(0)).int
>>> Value())
>>>                    {
>>>                        misses.append(new String(new Integer((int)
>>> counter).toString()) );
>>>                        newCounter++;
>>>
>>>                    }
>>>                    counter=new Integer(arg0.getString(0)).intValue()+1;
>>>                    return misses.toString();
>>>
>>>                }
>>>                counter++;
>>>                return null;
>>>
>>>
>>>
>>>            }
>>>        }).collect();
>>>
>>>
>>>
>>>        for (String name: missingNumbers) {
>>>              System.out.println(name);
>>>            }
>>>
>>>
>>>
>>>    }
>>> }
>>>
>>>
>>>
>>>
>>> --
>>> View this message in context: http://apache-spark-user-list.
>>> 1001560.n3.nabble.com/filling-missing-values-in-a-sequence-t
>>> p5708p27748.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>>
>>>
>>
>>
>> --
>> Thanks & Regards
>> Sudhindra S Magadi
>>
>
>
>
> --
> Best Regards,
> Ayan Guha
>



-- 
Thanks & Regards
Sudhindra S Magadi

Reply via email to